# Language Processing Pipeline in Spacy

In [30]:
#import paskages
import spacy

from spacy import displacy

In [2]:
nlp=spacy.blank("en")

In [16]:
doc= nlp("Captain america ate 100$ of samosa. Then he said I can do this all day")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day


In [17]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [18]:
nlp = spacy.load("en_core_web_sm")

In [19]:
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

1. `tok2vec`: This component is responsible for converting tokens into vectors, which is useful for various downstream tasks.

2. `tagger`: The part-of-speech (POS) tagger assigns grammatical categories (like noun, verb, adjective) to each token.

3. `parser`: The dependency parser analyzes the grammatical structure of a sentence and assigns syntactic dependencies between words.

4. `attribute_ruler`: This component helps in setting attributes on tokens based on rules.

5. `lemmatizer`: The lemmatizer reduces words to their base or root form (e.g., "running" to "run").

6. `ner`: Named Entity Recognition (NER) identifies and classifies entities such as persons, organizations, and locations in the text.

In [20]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x23266117460>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x23266117040>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x23261cc8900>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x23263825e40>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x232660e7c80>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x2326340ad60>)]

In [21]:
doc= nlp("Captain america ate 100$ of samosa. Then he said I can do this all day")

for token in doc:
    print(token , " | " , token.pos_ , " | " ,token.lemma_)

Captain  |  PROPN  |  Captain
america  |  PROPN  |  america
ate  |  VERB  |  eat
100  |  NUM  |  100
$  |  NUM  |  $
of  |  ADP  |  of
samosa  |  PROPN  |  samosa
.  |  PUNCT  |  .
Then  |  ADV  |  then
he  |  PRON  |  he
said  |  VERB  |  say
I  |  PRON  |  I
can  |  AUX  |  can
do  |  VERB  |  do
this  |  PRON  |  this
all  |  DET  |  all
day  |  NOUN  |  day


In [29]:
doc= nlp("Tesla Inc is going to acquire twitter for $45 billion")
#print each named entity  , label and an explanation of the label
for ent in doc.ents: #extract named entities
    print(f"{ent.text} | {ent.label_}  | {spacy.explain(ent.label_)} ")

Tesla Inc | ORG  | Companies, agencies, institutions, etc. 
$45 billion | MONEY  | Monetary values, including unit 


###### displacy

This will generate a visualization of the named entities in the text with their respective labels. The entities will be highlighted with different colors based on their types

In [32]:
displacy.render(doc , style='ent')

In [38]:
#pipline language Franca 
nlp = spacy.load("fr_core_news_sm")
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")

for ent in doc.ents:
    print(ent.text , " | ", ent.label_  ," | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [52]:
#pipline empty
nlp = spacy.blank("en")
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")

for ent in doc.ents:
    print(ent.text , " | " , ent.label_ , " | " , spacy.explain(ent.label_))

not print anything

###### I will add custom component

In [53]:
source_nlp=spacy.load("en_core_web_sm")
nlp= spacy.blank("en")

nlp.add_pipe('ner' , source=source_nlp)
nlp.pipe_names

['ner']

In [56]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text , " | " , ent.label_ , " | " , spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
$45 milliards de dollars  |  MONEY  |  Monetary values, including unit


In [57]:
doc = nlp("Tesla Inc is going to acquire  Twitter for $45")
for ent in doc.ents:
    print(ent.text , " | " , ent.label_ , " | " , spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter  |  PRODUCT  |  Objects, vehicles, foods, etc. (not services)
45  |  MONEY  |  Monetary values, including unit
