## To install this package with conda run:
conda install -c anaconda biopython 
or pip install biopython

##### PubMed is a search engine accessing millions of biomedical citations. Users can freely search for biomedical references. For some articles, the access to the full text paper is also open.

##### There are two main options to consider:

"Accessing the database via their public API
Using a package that does the above for you, e.g. Biopython
The Entrez Database a.k.a. the PubMed API

The PubMed API is called the Entrez Database. It’s a web service freely accessible, although there are some guidelines to follow (at the moment of this writing, they recommend not to post more than three requests per second).

There are in total 8 different functions, or e-utilities, which access the database in different ways. Most of the utilities will return XML data, although some of them have the option to return a more convenient JSON format.

In particular, the search API is available at the following URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi
If we want to search for the term fever, the URL we need is for example:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&retmode=json&retmax=20&sort=relevance&term=fever
The query string parameters used in this example:

db=pubmed, to narrow the search down to the pubmed DB only
retmode=json, to have a JSON string in response and not an XML
retmax=20, to obtain 20 results
sort=relevance, the results are sorted by relevance and not by added date which is the default ranking option on pubmed
term=[your query], the URL-encoded query
This search session will provide a number of PubMed IDs (probably 20) corresponding to the top citations which match our query.

In order to get some more details about these citations, we can use the efetch utility, which takes one or more citation ID as input. At the moment, the efetch utility does not return JSON, so XML is the only option to consider.

Given a list of citation IDs, the fetch operation can be built as follows

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&retmode=xml&id=ID1,ID2,...
At this point, the response will be an XML to handle with e.g. minidom or other XML library. Please notice that we can query the efetch utility for multiple documents, simply by separating them with a comma.

Overall, it’s relatively easy to create the appropriate request using libraries like urllib.request or, better, requests. The response can be parsed with the json module, or minidom in case of XML.

An even more convenient way to do the job is to use an existing library that does what we need for us. A good example is Biopython, a comprehensive package for biological computation in Python."

###### Full discussion:
# https://marcobonzanini.wordpress.com/2015/01/12/searching-pubmed-with-python/


# text mining

In [1]:
from Bio import Entrez
import json


def search(query):
    Entrez.email = 'shuzhenyu888@gmail.com'
    handle = Entrez.esearch(db='pubmed', 
                            sort='relevance', 
                            retmax='100',
                            retmode='xml', 
                            term=query)
    results = Entrez.read(handle)
    return results

In [2]:
def fetch_details(id_list):
    ids = ','.join(id_list)
    Entrez.email = 'shuzhenyu888@gmail.com'
    handle = Entrez.efetch(db='pubmed',
                           retmode='xml',
                           id=ids)
    results = Entrez.read(handle)
    return results

In [3]:
feature_list=['breast cancer, diagnosis, cell nuclei, image, radius, area']
if __name__ == '__main__':
    results = search(feature_list)
    id_list = results['IdList']
    papers = fetch_details(id_list)
    
    with open('papers.json', 'w') as outfile:
        json.dump(papers, outfile)
#     for i, paper in enumerate(papers['PubmedArticle']): 
#         print("%d) %s" % (i+1, paper['MedlineCitation']['Article']['ArticleTitle']))
#         print(json.dumps(paper, indent=2, separators=(',', ':')))
    # Pretty print the first paper in full
    #import json
    #print(json.dumps(papers[0], indent=2, separators=(',', ':')))

In [4]:

if __name__ == '__main__':
    
    txt = ""
    abstract=[]
    results = search('breast cancer, nuclei , image,  malignant, benign')
    id_list = results['IdList']
    papers = fetch_details(id_list)
    for i, paper in enumerate(papers['PubmedArticle']): 
        #print("%d) %s" % (i+1, paper['MedlineCitation']['Article']['Abstract']["AbstractText"]))
    
        #abstract.append( paper['MedlineCitation']['Article']['Abstract']["AbstractText"])
        #print(abstract)
        paper_obj = {}
        
        try:
            #concatenate the abstract together
            paper_obj["title"] = paper['MedlineCitation']['Article']['ArticleTitle']
            paper_obj["abstract"] = paper['MedlineCitation']['Article']['Abstract']['AbstractText']
            #for j in paper['MedlineCitation']['Article']['Abstract']['AbstractText']:
                #txt += j+' '
            #paper_obj["abstract"] = txt
        except KeyError:
            pass 
        
        abstract.append(paper_obj)
      
        

In [5]:
abstract[2]

{'abstract': ['Prompt and widely available diagnostics of breast cancer is crucial for the prognosis of patients. One of the diagnostic methods is the analysis of cytological material from the breast. This examination requires extensive knowledge and experience of the cytologist. Computer-aided diagnosis can speed up the diagnostic process and allow for large-scale screening. One of the largest challenges in the automatic analysis of cytological images is the segmentation of nuclei. In this study, four different clustering algorithms are tested and compared in the task of fast nuclei segmentation. K-means, fuzzy C-means, competitive learning neural networks and Gaussian mixture models were incorporated for clustering in the color space along with adaptive thresholding in grayscale. These methods were applied in a medical decision support system for breast cancer diagnosis, where the cases were classified as either benign or malignant. In the segmented nuclei, 42 morphological, topologi

In [6]:
import pandas as pd
BC_abstract = pd.DataFrame(abstract)
BC_abstract.head()

Unnamed: 0,abstract,title
0,[Histopathology is the clinical standard for t...,Rapid staining and imaging of subnuclear featu...
1,[Accurate detection of breast malignancy from ...,Nuclear nano-morphology markers of histologica...
2,[Prompt and widely available diagnostics of br...,Computer-aided diagnosis of breast cancer base...
3,[Cell nuclei classification in breast cancer h...,Breast cancer cell nuclei classification in hi...
4,[Grading schemes for breast cancer diagnosis a...,"Isotropic 3D nuclear morphometry of normal, fi..."


In [7]:
BC_abstract .update(BC_abstract .abstract[BC_abstract .abstract.apply(type) == list].str[0])


In [8]:
BC_abstract.head()

Unnamed: 0,abstract,title
0,Histopathology is the clinical standard for ti...,Rapid staining and imaging of subnuclear featu...
1,Accurate detection of breast malignancy from h...,Nuclear nano-morphology markers of histologica...
2,Prompt and widely available diagnostics of bre...,Computer-aided diagnosis of breast cancer base...
3,Cell nuclei classification in breast cancer hi...,Breast cancer cell nuclei classification in hi...
4,Grading schemes for breast cancer diagnosis ar...,"Isotropic 3D nuclear morphometry of normal, fi..."


In [9]:
BC_abstract.to_dense().to_csv("paper.csv", index = False, sep=',', encoding='utf-8')

# NLP--TFIDF

In [None]:
from pyspark import SparkContext
from pyspark import SparkConf

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import HashingTF, IDF, Tokenizer, StopWordsRemover

In [None]:
# Start a spark session
spark = SparkSession.builder.appName('stuf_idf').getOrCreate()

In [None]:
# Read in csv
dataframe = spark.read.format("csv").option("header", "true").load("paper.csv")
dataframe.show()

In [None]:
# Tokenize dataframe
tokened = Tokenizer(inputCol="abstract", outputCol="abstract_words" )
#tokened = Tokenizer(inputCol="title", outputCol="title_words" )
tokened_transformed = tokened.transform(dataframe)
tokened_transformed.show()

In [15]:
# Remove stop words
remover = StopWordsRemover(inputCol="abstract_words", outputCol="filtered",)
removed_frame = remover.transform(tokened_transformed)
removed_frame.show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [16]:
# Run the hashing term frequency
hashing = HashingTF(inputCol="filtered", outputCol="hashedValues", numFeatures=pow(2,4))

# Transform into a DF
hashed_df = hashing.transform(removed_frame)
hashed_df.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+
|            abstract|               title|      abstract_words|            filtered|        hashedValues|
+--------------------+--------------------+--------------------+--------------------+--------------------+
|Histopathology is...|Rapid staining an...|[histopathology, ...|[histopathology, ...|(16,[1,2,3,4,5,6,...|
|"Accurate detecti...| such as high-ris...|["accurate, detec...|["accurate, detec...|(16,[0,1,2,3,4,5,...|
|Prompt and widely...|Computer-aided di...|[prompt, and, wid...|[prompt, widely, ...|(16,[0,1,2,3,4,5,...|
|Cell nuclei class...|Breast cancer cel...|[cell, nuclei, cl...|[cell, nuclei, cl...|(16,[0,1,2,3,4,6,...|
|Grading schemes f...|Isotropic 3D nucl...|[grading, schemes...|[grading, schemes...|(16,[0,1,2,4,5,6,...|
|Digital cytology ...|Multi-label fast ...|[digital, cytolog...|[digital, cytolog...|(16,[0,1,2,3,4,5,...|
|To study the disc...|Textural analys

In [17]:
# Fit the IDF on the data set 
idf = IDF(inputCol="hashedValues", outputCol="features")
idfModel = idf.fit(hashed_df)
rescaledData = idfModel.transform(hashed_df)

In [18]:
# Display the dataframe
rescaledData.select("abstract_words", "features").show(truncate=False)

+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

##### extract the weighted words from title and abstract

In [53]:

title_list=BC_abstract['title'].tolist()

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(title_list)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 20
top_features = [features[i] for i in indices[:top_n]]
print (top_features)

['implementation', 'fibrocystic', 'glandular', 'gland', 'g2', 'fuzzy', 'fraction', 'fourth', 'four', 'fluids', 'field', 'features', 'growth', 'feasibility', 'fast', 'exudative', 'examination', 'erbb', 'effect', 'early']


In [55]:
abstract_list=BC_abstract['abstract'].tolist()

In [56]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(abstract_list)
indices = np.argsort(vectorizer.idf_)[::-1]
features = vectorizer.get_feature_names()
top_n = 20
top_features = [features[i] for i in indices[:top_n]]
print (top_features)

['increases', 'fisher', 'flagged', 'flattened', 'fluids', 'fluorescently', 'fna', 'focal', 'foci', 'following', 'follows', 'form', 'formaldehyde', 'formally', 'formed', 'forming', 'forty', 'fixation', 'first', 'multiple']


In [57]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(title_list)
idf = vectorizer.idf_
print (dict(zip(vectorizer.get_feature_names(), idf)))

{'3d': 4.0445224377234226, '70': 4.0445224377234226, 'adjacent': 4.0445224377234226, 'after': 4.0445224377234226, 'ag': 4.0445224377234226, 'agnors': 4.0445224377234226, 'aided': 4.0445224377234226, 'alterations': 4.0445224377234226, 'an': 4.0445224377234226, 'analyses': 4.0445224377234226, 'analysis': 2.09861228866811, 'and': 1.6466271649250523, 'apocrine': 4.0445224377234226, 'appearance': 4.0445224377234226, 'appearing': 4.0445224377234226, 'application': 3.3513752571634776, 'approach': 3.6390573296152584, 'area': 4.0445224377234226, 'aspirates': 3.1282317058492679, 'aspiration': 3.3513752571634776, 'assessment': 4.0445224377234226, 'associated': 3.6390573296152584, 'at': 4.0445224377234226, 'atypical': 4.0445224377234226, 'automated': 2.9459101490553135, 'b1': 4.0445224377234226, 'balance': 4.0445224377234226, 'based': 3.6390573296152584, 'benign': 2.435084525289323, 'between': 4.0445224377234226, 'biopsy': 3.1282317058492679, 'borderline': 3.6390573296152584, 'breast': 1.126751705

# Extracting, transforming and selecting features and saved as CSV files

In [58]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix #need this if you want to save tfidf_matrix

tf = TfidfVectorizer(input='title_list', analyzer='word', ngram_range=(1,6),
                     min_df = 0, stop_words = 'english', sublinear_tf=True)
tfidf_matrix =  tf.fit_transform(title_list)

In [59]:
feature_names = tf.get_feature_names()

In [60]:
doc = 0
feature_index = tfidf_matrix[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfidf_matrix[doc, x] for x in feature_index])

In [61]:
title_w=[]
title_s=[]
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
  print (w, s)
  title_w.append(w)
  title_s.append(s)

rapid 0.128322575483
staining 0.128322575483
imaging 0.128322575483
subnuclear 0.128322575483
features 0.128322575483
differentiate 0.128322575483
malignant 0.0689349766174
benign 0.0772591381592
breast 0.0357490119103
tissues 0.128322575483
point 0.128322575483
care 0.128322575483
setting 0.128322575483
rapid staining 0.128322575483
staining imaging 0.128322575483
imaging subnuclear 0.128322575483
subnuclear features 0.128322575483
features differentiate 0.128322575483
differentiate malignant 0.128322575483
malignant benign 0.128322575483
benign breast 0.128322575483
breast tissues 0.128322575483
tissues point 0.128322575483
point care 0.128322575483
care setting 0.128322575483
rapid staining imaging 0.128322575483
staining imaging subnuclear 0.128322575483
imaging subnuclear features 0.128322575483
subnuclear features differentiate 0.128322575483
features differentiate malignant 0.128322575483
differentiate malignant benign 0.128322575483
malignant benign breast 0.128322575483
benign

In [77]:
import pandas
T_df = pandas.DataFrame(data={"titel_words": title_w, "title_TFIDF score": title_s})
T_df=T_df.sort_values(['title_TFIDF score'], ascending=False)
T_df = T_df.reset_index(drop=True)
T_df

Unnamed: 0,titel_words,title_TFIDF score
0,rapid,0.128323
1,staining imaging subnuclear features different...,0.128323
2,tissues point care,0.128323
3,point care setting,0.128323
4,rapid staining imaging subnuclear,0.128323
5,staining imaging subnuclear features,0.128323
6,imaging subnuclear features differentiate,0.128323
7,subnuclear features differentiate malignant,0.128323
8,features differentiate malignant benign,0.128323
9,differentiate malignant benign breast,0.128323


In [78]:
T_df.to_csv("title.csv", sep=',',index=False)

In [79]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
from scipy.sparse.csr import csr_matrix #need this if you want to save tfidf_matrix

tf = TfidfVectorizer(input='abstract_list', analyzer='word', ngram_range=(1,6),
                     min_df = 0, stop_words = 'english', sublinear_tf=True)
tfidf_matrix =  tf.fit_transform(abstract_list)

In [80]:
feature_names = tf.get_feature_names()

In [81]:

doc = 0
feature_index = tfidf_matrix[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [tfidf_matrix[doc, x] for x in feature_index])

In [82]:
abstract_w=[]
abstract_s=[]
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
  print (w, s)
  abstract_w.append(w)
  abstract_s.append(s)

histopathology 0.0505993471873
clinical 0.0444776006793
standard 0.0444776006793
tissue 0.107080319083
diagnosis 0.0688426977259
requires 0.0549427954575
processing 0.0549427954575
laboratory 0.0610645419654
personnel 0.0610645419654
infrastructure 0.0610645419654
highly 0.0549427954575
trained 0.0549427954575
pathologist 0.0610645419654
diagnose 0.0549427954575
optical 0.0444776006793
microscopy 0.0753071241882
provide 0.0444776006793
real 0.0549427954575
time 0.0505993471873
used 0.0353261110328
inform 0.0610645419654
management 0.0505993471873
breast 0.0281217407997
cancer 0.0296689576309
goal 0.0549427954575
work 0.0549427954575
obtain 0.0505993471873
images 0.0353261110328
morphology 0.0444776006793
fluorescence 0.0549427954575
vital 0.0610645419654
fluorescent 0.0549427954575
stains 0.0549427954575
develop 0.0549427954575
strategy 0.0610645419654
segment 0.0610645419654
quantify 0.0610645419654
features 0.0278906593931
order 0.0549427954575
enable 0.0610645419654
automated 0.0383

In [91]:
A_df = pandas.DataFrame(data={"abstract_words": abstract_w, "abstract_TFIDF_score": abstract_s})

A_df=A_df.sort_values(['abstract_TFIDF_score'], ascending=False)
A_df = A_df.reset_index(drop=True)
A_df

Unnamed: 0,abstract_TFIDF_score,abstract_words
0,0.107080,tissue
1,0.103391,tissue diagnosis
2,0.075307,microscopy
3,0.068843,diagnosis
4,0.061065,histopathology clinical standard tissue
5,0.061065,enable automated tissue diagnosis
6,0.061065,diagnosis requires tissue processing laboratory
7,0.061065,tissue diagnosis requires tissue processing
8,0.061065,standard tissue diagnosis requires tissue
9,0.061065,clinical standard tissue diagnosis requires


In [87]:
df_col_merged =pd.concat([A_df, T_df], axis=1)

In [88]:
df_col_merged

Unnamed: 0,abstract_TFIDF_score,abstract_words,titel_words,title_TFIDF score
0,0.107080,tissue,rapid,0.128323
1,0.103391,tissue diagnosis,staining imaging subnuclear features different...,0.128323
2,0.075307,microscopy,tissues point care,0.128323
3,0.068843,diagnosis,point care setting,0.128323
4,0.061065,histopathology clinical standard tissue,rapid staining imaging subnuclear,0.128323
5,0.061065,enable automated tissue diagnosis,staining imaging subnuclear features,0.128323
6,0.061065,diagnosis requires tissue processing laboratory,imaging subnuclear features differentiate,0.128323
7,0.061065,tissue diagnosis requires tissue processing,subnuclear features differentiate malignant,0.128323
8,0.061065,standard tissue diagnosis requires tissue,features differentiate malignant benign,0.128323
9,0.061065,clinical standard tissue diagnosis requires,differentiate malignant benign breast,0.128323


In [95]:
df_col_merged.fillna(0, inplace=True)
df_col_merged

Unnamed: 0,abstract_TFIDF_score,abstract_words,titel_words,title_TFIDF score
0,0.107080,tissue,rapid,0.128323
1,0.103391,tissue diagnosis,staining imaging subnuclear features different...,0.128323
2,0.075307,microscopy,tissues point care,0.128323
3,0.068843,diagnosis,point care setting,0.128323
4,0.061065,histopathology clinical standard tissue,rapid staining imaging subnuclear,0.128323
5,0.061065,enable automated tissue diagnosis,staining imaging subnuclear features,0.128323
6,0.061065,diagnosis requires tissue processing laboratory,imaging subnuclear features differentiate,0.128323
7,0.061065,tissue diagnosis requires tissue processing,subnuclear features differentiate malignant,0.128323
8,0.061065,standard tissue diagnosis requires tissue,features differentiate malignant benign,0.128323
9,0.061065,clinical standard tissue diagnosis requires,differentiate malignant benign breast,0.128323


In [96]:
df_col_merged.to_csv("title_abstract.csv", sep=',',index=False)

In [97]:
# Stop Spark
spark.stop()