## Spacy Language Processing Pipeline 
**Pipeline** is basically a bunch of components. When you do 'nlp = spacy.blank("en"), this will create a blank pipeline. So you take text as input, then you do tokenizer and then there are couple of components to be added to this blank pipeline as shown in image bellow:

<img src = "img.jpg" width = "600px" height = "300px"></img>

So the question is what are those components? Well those components will be 'tagger', 'parser', named entity recognization (ner) ... as shown in image bellow:

<img src = "img1.jpg" width = "600px" height = "300px"></img>

In [1]:
# Let's import spacy library.
import spacy

In [3]:
# So let's first create a blank spacy pipeline and then add some text to the language model:
nlp = spacy.blank("en")
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")
for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


In [4]:
# So as we see our tokenizer tokens the text, because tokenizer is created by default while creating blank spacy pipeline.
# If we see the components of current created blank pipeline, it will be blank, becaus we created blank and still we 
# didn't add any component:
nlp.pipe_names

[]

In [5]:
# Now we'll use some pre-trained pipelines with different components.
# So if we go to 'Spacy' documentation, for each language we'll find some pre-trained pipelines. For English we installed 
# previously using 'python -m spacy download en' command.
# So once we have downloaded the pre-trained pipeline, then instead of blank pipeline we will create a pipeline which has all
# the components.
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [7]:
# Or we can do:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x1abd0056fa0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x1abcfaf6b80>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x1abd03dac80>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x1abd03604c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x1abd0367440>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x1abd03da510>)]

In [8]:
# So now using this components we can do much things:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)  # 'pos' is part of speech. 'lemma_' shows the base word.

Captain  |  PROPN  |  Captain
america  |  PROPN  |  america
ate  |  VERB  |  eat
100  |  NUM  |  100
$  |  NUM  |  $
of  |  ADP  |  of
samosa  |  PROPN  |  samosa
.  |  PUNCT  |  .
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
this  |  PRON  |  this
all  |  DET  |  all
day  |  NOUN  |  day
.  |  PUNCT  |  .


In [13]:
# So because of 'tagger' component we did 'pos' and through 'lemmatizer' component we did 'lemm_'.
# 'ner' will do name entity recognization.
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text," | ", ent.label_," | ",spacy.explain(ent.label_))# will show us the name of words. ORG is organization.

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 billion  |  MONEY  |  Monetary values, including unit


In [14]:
# So now we understand when we loadded a train pipeline with all the components, in reality we get some in build features.
# So the components is nothing but the language process pipeline as shown bellow:

<img src = "img2.jpg" width = "600px" height = "300px"></img>

* So for some of the language the pre-trained pipelines are exist in spacy documents but for other languages it's not defined. The other languages just jave the basic tokenizer.

In [15]:
# For better visualization we can use 'displacy' model:
from spacy import displacy
displacy.render(doc, style="ent")

In [19]:
# When we're using plain pipeline we won't get 'pos' or 'lemma_', becaus it's plain and didn't give any thing:
nlp = spacy.blank("en")
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for token in doc:
    print(token," | ", token.pos_, " | ", token.lemma_)

Tesla  |    |  
Inc  |    |  
is  |    |  
going  |    |  
to  |    |  
acquire  |    |  
twitter  |    |  
for  |    |  
$  |    |  
45  |    |  
billion  |    |  


In [20]:
# Now if we want to have a specific components in the pipeline, we don't want to have all the components. So for that first 
# We load the English pipeline, then we create a blank pipeline and in that blank pipeline we add 'enr' and define the source
# from which we added it. Which means from English pipeline add 'ner' component to the blank pipeline.

source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names   # It will shows us the added component which is 'ner'.

['ner']

In [21]:
# So we can customize our blank pipeline and add custom components.