Steps for NLP Pipeline that we can implement in our algorithm after further literature research:

$\newline$1. Sentence segmentation: breaks the given paragraph into separate sentences.
$\newline$2. Word tokenization: extract the words from each sentence one by one.
$\newline$3. 'Parts of Speech' Prediction: identifying parts of speech.
$\newline$4. Text Lemmatization: figure out the most basic form of each word in a sentence. "Germ" and "Germs" can have two different meanings and we should look to solve that.
$\newline$5. 'Stop Words' Identification: English has a lot of filter words that appear very frequently and that introduces a lot of noise.
$\newline$6. Dependency Parsing: uses the grammatical laws to figure out how the words relate to one another.
$\newline$7. Entity Analysis: go through the text and identify all of the important words or “entities” in the text.
$\newline$8. Pronouns Parsing: keeps track of the pronouns with respect to the context of the sentence.

## Step 1: sentence segmentation

In [1]:
#pip install spacy
#spacy.cli.download("en_core_web_sm")

In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
doc = nlp(u"While scanning the water for these hydrodynamic signals at a swimming speed in the order of meters per second, the seal keeps its long and flexible whiskers in an abducted position, largely perpendicular to the swimming direction. Remarkably, the whiskers of harbor seals possess a specialized undulated surface structure, the function of which was, up to now, unknown. Here, we show that this structure effectively changes the vortex street behind the whiskers and reduces the vibrations that would otherwise be induced by the shedding of vortices from the whiskers (vortex-induced vibrations). Using force measurements, flow measurements and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure. The results are discussed in the light of pinniped sensory biology and potential biomimetic applications.")

In [5]:
for sent in doc.sents:
    print(sent)

While scanning the water for these hydrodynamic signals at a swimming speed in the order of meters per second, the seal keeps its long and flexible whiskers in an abducted position, largely perpendicular to the swimming direction.
Remarkably, the whiskers of harbor seals possess a specialized undulated surface structure, the function of which was, up to now, unknown.
Here, we show that this structure effectively changes the vortex street behind the whiskers and reduces the vibrations that would otherwise be induced by the shedding of vortices from the whiskers (vortex-induced vibrations).
Using force measurements, flow measurements and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure.
The results are discussed in the light of pinniped sensory biology and potential biomimetic applications.


## Step 2: Word tokenization

In [6]:
biomim = nlp(open('Abstract_textextraction.txt').read())
words_biomim = [word.text for word in biomim]
print(words_biomim)

['While', 'scanning', 'the', 'water', 'for', 'these', 'hydrodynamic', 'signals', 'at', 'a', 'swimming', 'speed', 'in', 'the', 'order', 'of', 'meters', 'per', 'second', ',', 'the', 'seal', 'keeps', 'its', 'long', 'and', 'flexible', 'whiskers', 'in', 'an', 'abducted', 'position', ',', 'largely', 'perpendicular', 'to', 'the', 'swimming', 'direction', '.', 'Remarkably', ',', 'the', 'whiskers', 'of', 'harbor', 'seals', 'possess', 'a', 'specialized', 'undulated', 'surface', 'structure', ',', 'the', 'function', 'of', 'which', 'was', ',', 'up', 'to', 'now', ',', 'unknown', '.', 'Here', ',', 'we', 'show', 'that', 'this', 'structure', 'effectively', 'changes', 'the', 'vortex', 'street', 'behind', 'the', 'whiskers', 'and', 'reduces', 'the', 'vibrations', 'that', 'would', 'otherwise', 'be', 'induced', 'by', 'the', 'shedding', 'of', 'vortices', 'from', 'the', 'whiskers', '(', 'vortex', '-', 'induced', 'vibrations', ')', '.', 'Using', 'force', 'measurements', ',', 'flow', 'measurements', 'and', 'num

Nouns

In [7]:
print("Noun phrases:", [chunk.text for chunk in biomim.noun_chunks])

Noun phrases: ['the water', 'these hydrodynamic signals', 'a swimming speed', 'the order', 'meters', 'second', 'the seal', 'its long and flexible whiskers', 'an abducted position', 'the swimming direction', 'the whiskers', 'harbor seals', 'a specialized undulated surface structure', 'the function', 'which', 'we', 'this structure', 'the vortex street', 'the whiskers', 'the vibrations', 'that', 'the shedding', 'vortices', 'the whiskers', 'vortex-induced vibrations', 'force measurements', 'we', 'the dynamic forces', 'harbor seal whiskers', 'at least an order', 'magnitude', 'those', 'sea lion', 'Zalophus californianus) whiskers', 'which', 'the undulated structure', 'The results', 'the light', 'pinniped sensory biology', 'potential biomimetic applications']


Verbs

In [8]:
print("Verbs:", [token.lemma_ for token in biomim if token.pos_ == "VERB"])

Verbs: ['scan', 'keep', 'possess', 'undulate', 'show', 'change', 'reduce', 'induce', 'induce', 'use', 'find', 'share', 'undulate', 'discuss', 'pinnipe']


Named entities

In [9]:
for entity in biomim.ents:
    print(entity.text, entity.label_)

Zalophus PERSON


## Step 3: Parts-of-speech prediction and Step 8: Pronouns Parsing

In [10]:
for token in doc:
    # Print the token and its part-of-speech tag
    print(token.text, "-->", token.pos_)

While --> SCONJ
scanning --> VERB
the --> DET
water --> NOUN
for --> ADP
these --> DET
hydrodynamic --> ADJ
signals --> NOUN
at --> ADP
a --> DET
swimming --> NOUN
speed --> NOUN
in --> ADP
the --> DET
order --> NOUN
of --> ADP
meters --> NOUN
per --> ADP
second --> NOUN
, --> PUNCT
the --> DET
seal --> NOUN
keeps --> VERB
its --> PRON
long --> ADJ
and --> CCONJ
flexible --> ADJ
whiskers --> NOUN
in --> ADP
an --> DET
abducted --> ADJ
position --> NOUN
, --> PUNCT
largely --> ADV
perpendicular --> ADJ
to --> ADP
the --> DET
swimming --> NOUN
direction --> NOUN
. --> PUNCT
Remarkably --> ADV
, --> PUNCT
the --> DET
whiskers --> NOUN
of --> ADP
harbor --> NOUN
seals --> NOUN
possess --> VERB
a --> DET
specialized --> ADJ
undulated --> VERB
surface --> NOUN
structure --> NOUN
, --> PUNCT
the --> DET
function --> NOUN
of --> ADP
which --> PRON
was --> AUX
, --> PUNCT
up --> ADP
to --> ADP
now --> ADV
, --> PUNCT
unknown --> ADJ
. --> PUNCT
Here --> ADV
, --> PUNCT
we --> PRON
show --> VERB

In [11]:
spacy.explain("PART")

'particle'

## Step 4: Text Lemmatization

In [12]:
#extract lemma for each token:
" ".join([token.lemma_ for token in doc])

'while scan the water for these hydrodynamic signal at a swimming speed in the order of meter per second , the seal keep its long and flexible whisker in an abducted position , largely perpendicular to the swimming direction . remarkably , the whisker of harbor seal possess a specialized undulate surface structure , the function of which be , up to now , unknown . here , we show that this structure effectively change the vortex street behind the whisker and reduce the vibration that would otherwise be induce by the shedding of vortex from the whisker ( vortex - induce vibration ) . use force measurement , flow measurement and numerical simulation , we find that the dynamic force on harbor seal whisker be , by at least an order of magnitude , low than those on sea lion ( zalophus californianus ) whisker , which do not share the undulate structure . the result be discuss in the light of pinnipe sensory biology and potential biomimetic application .'

## Step 5: 'Stop Words Identification'

In [13]:
for token in doc:
    print(token.text,token.is_stop)

While True
scanning False
the True
water False
for True
these True
hydrodynamic False
signals False
at True
a True
swimming False
speed False
in True
the True
order False
of True
meters False
per True
second False
, False
the True
seal False
keeps False
its True
long False
and True
flexible False
whiskers False
in True
an True
abducted False
position False
, False
largely False
perpendicular False
to True
the True
swimming False
direction False
. False
Remarkably False
, False
the True
whiskers False
of True
harbor False
seals False
possess False
a True
specialized False
undulated False
surface False
structure False
, False
the True
function False
of True
which True
was True
, False
up True
to True
now True
, False
unknown False
. False
Here True
, False
we True
show True
that True
this True
structure False
effectively False
changes False
the True
vortex False
street False
behind True
the True
whiskers False
and True
reduces False
the True
vibrations False
that True
would True
otherwis

In [14]:
# If we want to remove stop words:
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

text = 'While scanning the water for these hydrodynamic signals at a swimming speed in the order of meters per second, the seal keeps its long and flexible whiskers in an abducted position, largely perpendicular to the swimming direction. Remarkably, the whiskers of harbor seals possess a specialized undulated surface structure, the function of which was, up to now, unknown. Here, we show that this structure effectively changes the vortex street behind the whiskers and reduces the vibrations that would otherwise be induced by the shedding of vortices from the whiskers (vortex-induced vibrations). Using force measurements, flow measurements and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure. The results are discussed in the light of pinniped sensory biology and potential biomimetic applications.'
lst=[]
for token in text.split():
    if token.lower() not in stopwords:    #checking whether the word is not 
        lst.append(token)                    #present in the stopword list.
        
#Join items in the list
print("Original text  : ",text)
print("Text after removing stopwords  :   ",' '.join(lst))

Original text  :  While scanning the water for these hydrodynamic signals at a swimming speed in the order of meters per second, the seal keeps its long and flexible whiskers in an abducted position, largely perpendicular to the swimming direction. Remarkably, the whiskers of harbor seals possess a specialized undulated surface structure, the function of which was, up to now, unknown. Here, we show that this structure effectively changes the vortex street behind the whiskers and reduces the vibrations that would otherwise be induced by the shedding of vortices from the whiskers (vortex-induced vibrations). Using force measurements, flow measurements and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure. The results are discussed in the light of pinniped sensory biology and potential biomimetic applications.
Text 

In [15]:
#Filtering stop words from text file:
en = spacy.load('en_core_web_sm')
stopwords = en.Defaults.stop_words

with open("Abstract_textextraction.txt") as f:
    text=f.read()
    
lst=[]
for token in text.split():
    if token.lower() not in stopwords:
        lst.append(token)

print('Original Text')        
print(text,'\n\n')

print('Text after removing stop words')
print(' '.join(lst))

Original Text
While scanning the water for these hydrodynamic signals at a swimming speed in the order of meters per second, the seal keeps its long and flexible whiskers in an abducted position, largely perpendicular to the swimming direction. Remarkably, the whiskers of harbor seals possess a specialized undulated surface structure, the function of which was, up to now, unknown. Here, we show that this structure effectively changes the vortex street behind the whiskers and reduces the vibrations that would otherwise be induced by the shedding of vortices from the whiskers (vortex-induced vibrations). Using force measurements, flow measurements and numerical simulations, we find that the dynamic forces on harbor seal whiskers are, by at least an order of magnitude, lower than those on sea lion (Zalophus californianus) whiskers, which do not share the undulated structure. The results are discussed in the light of pinniped sensory biology and potential biomimetic applications.

 


Text

## Step 6: Dependency Parsing

In [16]:
for token in doc:
    print(token.text, "-->", token.dep_)

While --> mark
scanning --> advcl
the --> det
water --> dobj
for --> prep
these --> det
hydrodynamic --> amod
signals --> pobj
at --> prep
a --> det
swimming --> compound
speed --> pobj
in --> prep
the --> det
order --> pobj
of --> prep
meters --> pobj
per --> prep
second --> pobj
, --> punct
the --> det
seal --> nsubj
keeps --> ROOT
its --> poss
long --> amod
and --> cc
flexible --> conj
whiskers --> dobj
in --> prep
an --> det
abducted --> amod
position --> pobj
, --> punct
largely --> advmod
perpendicular --> amod
to --> prep
the --> det
swimming --> compound
direction --> pobj
. --> punct
Remarkably --> advmod
, --> punct
the --> det
whiskers --> nsubj
of --> prep
harbor --> compound
seals --> pobj
possess --> ROOT
a --> det
specialized --> amod
undulated --> amod
surface --> compound
structure --> dobj
, --> punct
the --> det
function --> nsubj
of --> prep
which --> pobj
was --> conj
, --> punct
up --> prep
to --> prep
now --> pcomp
, --> punct
unknown --> acomp
. --> punct
Here -

## Step 7: Entity Analysis

In [17]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Zalophus PERSON


## NN

In [18]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import base64
import string
import re
from collections import Counter
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
df = pd.read_json('golden.json')
df.head()

Unnamed: 0,paper,mesh_terms,venue_ids,venue_names,author_ids,author_names,reference_ids,title,abstract,isOpenAccess,...,doi,level1,level2,level3,isBiomimicry,url,mag_terms,species,absolute_relevancy,relative_relevancy
0,W2103410568,"[Anura, Nesting Behavior, Animals, Anura, Fema...",[V153317304],"[Europe PMC, Biology Letters, Weird Nature: An...","[A2346835213, A2098042950]","[Laura Dalgetty, Malcolm W. Kennedy]","[W2159311519, W2038086748, W2130285640, W22718...",Building a home from foam—túngara frog foam ne...,frogs that build foam nests floating on water ...,True,...,10.1098/RSBL.2009.0934,"[assemble/break_down_structure, protect_from_l...","[physically_assemble_structure, protect_from_n...","[protect_from_loss_of_liquids, protect_from_mi...",Y,https://royalsocietypublishing.org/doi/10.1098...,"[bubble nest, nest, mixing, bubble, phase, eng...","[engystomops pustulosus, tungara frog, frogs]","[0.015873015873015, 0.015873015873015, 0.03174...","[0.5, 0.5, 1.0]"
1,W2138292607,"[Chiroptera, Orientation, Sunlight, Animal Mig...",[V125754415],"[Europe PMC, Proceedings of the National Acade...","[A2132083079, A2425702268, A2552946098]","[Richard A. Holland, Ivailo M. Borissov, Björn...","[W1998502308, W1990263987, W1512621966, W20723...","A nocturnal mammal, the greater mouse-eared ba...",recent evidence suggests that bats can detect ...,True,...,10.1073/PNAS.0912477107,[sense_send_process_information],[sense_signals/environmental_cues],[sense_spatial_awareness/balance/orientation],Y,https://www.pnas.org/content/107/15/6941,"[sunset, earth s magnetic field, compass, noct...","[myotis myotis, mouse-eared bat, bird, birds, ...","[0.0, 0.0, 0.054545454545454, 0.05454545454545...","[0.0, 0.0, 1.0, 1.0, 0.33333333333333304, 0.66..."
2,W2005539166,"[Decapodiformes, Vision, Ocular, Animals, Deca...",[V20257348],"[The Journal of Experimental Biology, Current ...","[A2163942483, A3088803717]","[Christopher M. Talbot, Justin Marshall]","[W1512876644, W1893205716, W1976288139, W20527...",Polarization sensitivity in two species of cut...,the existence of polarization sensitivity (ps)...,True,...,10.1242/JEB.042937,[sense_send_process_information],[sense_signals/environmental_cues],"[sense_light_in_the_non-visible_spectrum, sens...",Y,https://jeb.biologists.org/content/213/19/3364,"[sepia mestus, optomotor response, cuttlefish,...","[drums, cephalopods, sepia plangon, cuttlefish]","[0.02127659574468, 0.02127659574468, 0.0212765...","[0.33333333333333304, 0.33333333333333304, 0.3..."
3,W2151557512,"[Archaea, Cellulase, Cellulase, Protein Struct...",[V64187185],[Nature Communications],"[A3187290035, A2189911855, A2659069087, A20418...","[Joel Edward Graham, Michael E. Clark, Dana C....","[W2123857752, W2140244239, W2141885858, W21256...",Identification and characterization of a multi...,archaea are microorganisms that use a wide ran...,True,...,10.1038/NCOMMS1373,"[protect_from_living/non-living_threats, chemi...","[chemically_break_down, protect_from_non-livin...","[chemically_break_down_organic_compounds, prot...",Y,https://www.nature.com/articles/ncomms1373,"[energy source, cellulase, archaea, cellulose,...",[],[],[]
4,W2160542693,[],[V164789166],"[Journal of Phycology, The Journal of Experime...","[A1790079306, A2557664461, A189902495, A214051...","[Patrick T. Martone, Diego A. Navarro, Carlos ...","[W1482765817, W2142255159, W175328287, W208251...",DIFFERENCES IN POLYSACCHARIDE STRUCTURE BETWEE...,the articulated coralline calliarthron cheilos...,False,...,10.1111/J.1529-8817.2010.00828.X,[manage_mechanical_forces],[manage_stress_strain],"[manage_shear, manage_stress/strain]",Y,https://onlinelibrary.wiley.com/doi/abs/10.111...,"[galactan, coralline algae, cell wall, polysac...",[calliarthron cheilosporioides],[0.012987012987012],[1.0]


In [19]:
df.isnull().sum()

paper                 0
mesh_terms            0
venue_ids             0
venue_names           0
author_ids            0
author_names          0
reference_ids         0
title                 0
abstract              0
isOpenAccess          0
fullDocLink           0
petalID               0
doi                   0
level1                0
level2                0
level3                0
isBiomimicry          0
url                   0
mag_terms             0
species               0
absolute_relevancy    0
relative_relevancy    0
dtype: int64

In [20]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.33, random_state=42)
print('Paper:', train['paper'].iloc[0])
print('Mesh Terms:', train['mesh_terms'].iloc[0])
print('Venue IDs:', train['venue_ids'].iloc[0])
print('Venue Names:', train['venue_names'].iloc[0])
print('Author Ids:', train['author_ids'].iloc[0])
print('Author Names:', train['author_names'].iloc[0])
print('Reference IDs:', train['reference_ids'].iloc[0])
print('Title:', train['title'].iloc[0])
print('Abstract:', train['abstract'].iloc[0])
print('Open Access:', train['isOpenAccess'].iloc[0])
print('Full Doc Link:', train['fullDocLink'].iloc[0])
print('PeTaL ID:', train['petalID'].iloc[0])
print('doi:', train['doi'].iloc[0])
print('Level 1:', train['level1'].iloc[0])
print('Level 2:', train['level2'].iloc[0])
print('Level 3:', train['level3'].iloc[0])
print('Biomimicry:', train['isBiomimicry'].iloc[0])
print('url:', train['url'].iloc[0])
print('Mag Terms:', train['mag_terms'].iloc[0])
print('Species:', train['species'].iloc[0])
print('Absolute Relevancy:', train['absolute_relevancy'].iloc[0])
print('Relative Relevancy:', train['relative_relevancy'].iloc[0])
print('Training Data Shape:', train.shape)
print('Testing Data Shape:', test.shape)

Paper: W3123539913
Mesh Terms: ['Osteomalacia', 'Bone and Bones', 'Fourier Analysis', 'Hardness', 'Humans', 'Viscosity']
Venue IDs: ['V181675524']
Venue Names: ['Journal of Biomechanics', 'Journal of biomechanics']
Author Ids: ['A2641623412', 'A2761443407', 'A3121516640', 'A1224923036', 'A2125483482', 'A2136707749', 'A2315771281', 'A2408635047']
Author Names: ['I. Hadjab', 'Delphine Farlay', 'Pierrick Crozier', 'Thierry Douillard', 'Georges Boivin', 'Jérôme Chevalier', 'Sylvain Meille', 'Hélène Follet']
Reference IDs: ['W2070921333', 'W1608624687', 'W2035738092', 'W2157930105', 'W2083220561', 'W2991488528', 'W3014444608', 'W2153941374', 'W35508769', 'W2052261273', 'W2094982451', 'W1992591531', 'W2086933933', 'W1989674774', 'W2073829204', 'W2789355112', 'W2007698435', 'W2034025773', 'W2066646366', 'W2787614952', 'W1559827526', 'W2008828669', 'W2096618451', 'W2018120503', 'W2159986807', 'W2271045053', 'W194645740', 'W2052388011', 'W2132551393', 'W2924019999', 'W1998631739', 'W2035965297'

In [21]:
import spacy
nlp = spacy.load('en_core_web_sm')
punctuations = string.punctuation
def cleanup_text(docs, logging=False):
    texts = []
    counter = 1
    for doc in docs:
        if counter % 1000 == 0 and logging:
            print("Processed %d out of %d documents." % (counter, len(docs)))
        counter += 1
        doc = nlp(doc, disable=['parser', 'ner'])
        tokens = [tok.lemma_.lower().strip() for tok in doc if tok.lemma_ != '-PRON-']
        tokens = [tok for tok in tokens if tok not in stopwords and tok not in punctuations]
        tokens = ' '.join(tokens)
        texts.append(tokens)
    return pd.Series(texts)
INFO_text = [text for text in train[train['mesh_terms'] == 'Osteomalacia']['paper']]
IS_text = [text for text in train[train['mesh_terms'] == 'Bone and Bones']['paper']]
INFO_clean = cleanup_text(INFO_text)
INFO_clean = ' '.join(INFO_clean).split()
IS_clean = cleanup_text(IS_text)
IS_clean = ' '.join(IS_clean).split()
INFO_counts = Counter(INFO_clean)
IS_counts = Counter(IS_clean)
INFO_common_words = [word[0] for word in INFO_counts.most_common(20)]
INFO_common_counts = [word[1] for word in INFO_counts.most_common(20)]

  return pd.Series(texts)


In [22]:
IS_common_words = [word[0] for word in IS_counts.most_common(20)]
IS_common_counts = [word[1] for word in IS_counts.most_common(20)]

In [23]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
#from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS
from sklearn.feature_extraction import _stop_words
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from sklearn.preprocessing import MultiLabelBinarizer
import string
import re
import spacy
nlp = spacy.load("en_core_web_sm")
from spacy.lang.en import English
parser = English()

In [24]:
STOPLIST = set(stopwords.words('english'))
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”"]
class CleanTextTransformer(TransformerMixin):
    def transform(self, X, **transform_params):
        return [cleanText(text) for text in X]
    def fit(self, X, y=None, **fit_params):
        return self
def get_params(self, deep=True):
        return {}
    
def cleanText(text):
    text = text.strip().replace("\n", " ").replace("\r", " ")
    text = text.lower()
    return text
def tokenizeText(sample):
    tokens = parser(sample)
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_.lower().strip() if tok.lemma_ != "-PRON-" else tok.lower_)
    tokens = lemmas
    tokens = [tok for tok in tokens if tok not in STOPLIST]
    tokens = [tok for tok in tokens if tok not in SYMBOLS]
    return tokens

In [25]:
def printNMostInformative(vectorizer, clf, N):
    feature_names = vectorizer.get_feature_names()
    coefs_with_fns = sorted(zip(clf.coef_[0], feature_names))
    topClass1 = coefs_with_fns[:N]
    topClass2 = coefs_with_fns[:-(N + 1):-1]
    print("Class 1 best: ")
    for feat in topClass1:
        print(feat)
    print("Class 2 best: ")
    for feat in topClass2:
        print(feat)
vectorizer = CountVectorizer(tokenizer=tokenizeText, ngram_range=(1,1))
clf = LinearSVC()

pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer), ('clf', clf)])
# data
train1 = train['paper'].tolist()
labelsTrain1 = train['mesh_terms'].tolist()
test1 = test['paper'].tolist()
labelsTest1 = test['mesh_terms'].tolist()
# train
pipe.fit(train1, labelsTrain1)

####
# test
preds = pipe.predict(test1)

####
print("accuracy:", accuracy_score(labelsTest1, preds))
print("Top 10 features used to predict: ")

printNMostInformative(vectorizer, clf, 10)
pipe = Pipeline([('cleanText', CleanTextTransformer()), ('vectorizer', vectorizer)])
transform = MultiLabelBinarizer().fit_transform(train1, labelsTrain1)
vocab = vectorizer.get_feature_names()
for i in range(len(train1)):
    s = ""
    indexIntoVocab = transform.indices[transform.indptr[i]:transform.indptr[i+1]]
    numOccurences = transform.data[transform.indptr[i]:transform.indptr[i+1]]
    for idx, num in zip(indexIntoVocab, numOccurences):
        s += str((vocab[idx], num))

  y = np.asarray(y)


ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead - the MultiLabelBinarizer transformer can convert to this format.

In [None]:
from sklearn import metrics
print(metrics.classification_report(labelsTest1, preds, 
                                    target_names=df['mesh_terms'].unique()))

## Another NN method

In [None]:
import numpy as np
import tensorflow as tf

coefficients = np.array([[1.], [-10.], [25.]])

x = tf.placeholder(tf.float32, [3, 1])
w = tf.Variable(0, dtype=tf.float32)                 # Creating a variable w
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]

train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w)    # Runs the definition of w, if you print this it will print zero
session.run(train, feed_dict={x: coefficients})

print("W after one iteration:", session.run(w))

for i in range(1000):
	session.run(train, feed_dict={x: coefficients})

print("W after 1000 iterations:", session.run(w))

In [None]:
def Perceptron(input1, input2, output) :
    outputP = input1*weights[0]+input2*weights[1]+bias*weights[2]
    if outputP > 0 : #activation function (here Heaviside)
        outputP = 1
    else :
        outputP = 0
    error = output - outputP
    weights[0] += error * input1 * lr
    weights[1] += error * input2 * lr
    weights[2] += error * bias * lr

In [None]:
for i in range(50) :
    Perceptron(1,1,1) #True or true
    Perceptron(1,0,1) #True or false
    Perceptron(0,1,1) #False or true
    Perceptron(0,0,0) #False or false

In [None]:
x = int(input())
y = int(input())
outputP = x*weights[0] + y*weights[1] + bias*weights[2]
if outputP > 0 : #activation function
    outputP = 1
else :
    outputP = 0
print(x, "or", y, "is : ", outputP)

In [None]:
outputP = 1/(1+numpy.exp(-outputP)) #sigmoid function