## Natural Language Processing with Spacy Framework

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python. It's designed specifically for production use and helps you build applications that process and "understand" large volumes of text.


### Introduction to Information Extraction

Information Extraction (IE) is a crucial cog in the field of Natural Language Processing (NLP) and linguistics. It’s widely used for tasks such as Question Answering Systems, Machine Translation, Entity Extraction, Event Extraction, Named Entity Linking, Coreference Resolution, Relation Extraction, etc.

We can broadly divide Information Extraction into two branches:

Traditional Information Extraction, the relations to be extracted are pre-defined.
Open Information Extraction, the relations are not pre-defined. The system is free to extract any relations it comes across while going through the text data.

### Different Approaches to Information Extraction

There are multiple approaches to perform information extraction automatically. Let’s understand them one-by-one:

Rule-based Approach: We define a set of rules for the syntax and other grammatical properties of a natural language and then use these rules to extract information from text

Supervised: Let’s say we have a sentence S. It has two entities E1 and E2. Now, the supervised machine learning model has to detect whether there is any relation (R) between E1 and E2. So, in a supervised approach, the task of relation extraction turns into the task of relation detection. The only drawback of this approach is that it needs a lot of labeled data to train a model

Semi-supervised: When we don’t have enough labeled data, we can use a set of seed examples (triples) to formulate high-precision patterns that can be used to extract more relations from the text.



In [1]:
import spacy 

nlp = spacy.load('en_core_web_sm')

In [2]:
doc = nlp("Fiction's about what it is to be a human being.")

In [3]:
[token.text for token in doc]

['Fiction',
 "'s",
 'about',
 'what',
 'it',
 'is',
 'to',
 'be',
 'a',
 'human',
 'being',
 '.']

In [6]:
# part of speech tags
[token.pos_ for token in doc]

['PROPN',
 'AUX',
 'ADP',
 'PRON',
 'PRON',
 'AUX',
 'PART',
 'AUX',
 'DET',
 'ADJ',
 'NOUN',
 'PUNCT']

In [8]:
# fine grained part-of-speech tags
[token.tag_ for token in doc]

['NNP', 'VBZ', 'IN', 'WP', 'PRP', 'VBZ', 'TO', 'VB', 'DT', 'JJ', 'NN', '.']

In [10]:
# syntactic dependecy
[token.dep_ for token in doc]

['nsubj',
 'ROOT',
 'prep',
 'attr',
 'nsubj',
 'pcomp',
 'aux',
 'xcomp',
 'det',
 'amod',
 'attr',
 'punct']

In [12]:
# syntactic head toke
[token.head.text for token in doc]

["'s",
 "'s",
 "'s",
 'is',
 'is',
 'about',
 'be',
 'is',
 'being',
 'being',
 'be',
 "'s"]

In [17]:
# Named Entities
doc = nlp("This is so American, man: either make something your God and cosmos and then worship it, or else kill it.Fiction's about what it is to be a human being.")
[(ent.text, ent.label_) for ent in doc.ents]

[('American', 'NORP')]

In [18]:
# sentences
[sent.text for sent in doc.sents]

['This is so American, man: either make something your God and cosmos and then worship it, or else kill it.',
 "Fiction's about what it is to be a human being."]

In [19]:
# Base noun phrases
[chunk.text for chunk in doc.noun_chunks]

['something',
 'your God',
 'cosmos',
 'it',
 'it',
 'Fiction',
 'what',
 'it',
 'a human being']

In [20]:
# label explaineer
spacy.explain("NN")

'noun, singular or mass'

In [22]:
# Visualizing
from spacy import displacy

In [24]:
displacy.render(doc, style='dep')

In [25]:
# visualize named entities
displacy.render(doc, style='ent')

In [26]:
spacy.explain("NORP")

'Nationalities or religious or political groups'

In [27]:
# word vector similarity
doc1 = nlp("Tv’s “real” agenda is to be “liked,” because if you like what you’re seeing, you’ll stay tuned.")
doc2 = nlp("Tv is completely unabashed about this; it’s its sole raison.")

doc1.similarity(doc2)

  "__main__", mod_spec)


0.5375663421616256

In [28]:
# accessing word vectors
doc1.vector

array([ 0.35990968,  1.1010945 , -1.2016281 , -0.90050834,  2.1246927 ,
        0.12424658,  0.73570794,  0.45909515,  1.3454008 ,  0.44048792,
       -0.01104516, -1.3679193 ,  0.30202174,  0.08751535,  0.09242883,
       -0.14963065, -1.3951433 ,  0.0831148 , -0.44576877,  0.18858941,
       -0.14316843, -0.19358027, -0.5747817 ,  0.03631843, -0.01988212,
        0.7350926 ,  0.21418545, -0.44301704,  0.20097856,  0.42388466,
        0.47324508,  0.6995118 ,  0.3891565 ,  0.3508177 ,  0.61550367,
       -0.52222204, -0.35592186, -0.62786376, -0.8524878 , -0.7959564 ,
        0.60448796,  0.33411357, -0.6603433 , -0.75231504,  0.13181128,
       -0.542978  ,  0.3898444 , -0.5002389 , -1.4059976 , -0.38648227,
        0.7665743 ,  0.43114832,  0.06416342,  0.81875336, -0.5415367 ,
        0.1467485 ,  0.70429045, -0.5960693 ,  0.09859467,  1.4059186 ,
        1.1460891 , -0.83116317, -0.7075897 ,  0.2859249 , -0.09580757,
       -0.39038202, -0.14098622,  0.44758934,  0.21441807,  0.87

In [29]:
# pipeline components
nlp.pipe_names

['tagger', 'parser', 'ner']