<h2 align="center">Spacy Language Processing Pipelines Tutorial</h2>

<h3>Blank nlp pipeline</h3>

In [31]:
import spacy

nlp = spacy.blank("en")

doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token)

Captain
america
ate
100
$
of
samosa
.
Then
he
said
I
can
do
this
all
day
.


We get above error because we have a blank pipeline as shown below. Pipeline is something that starts with a Tokenizer component in a dotted rectange below. You can see there is nothing there hence the blank pipeline

<img height=300 width=400 src="https://github.com/jalalrahmanov/CodeBasics-nlp-tutorials-/blob/main/5_spacy_lang_processing_pipeline/spacy_blank_pipeline.jpg?raw=1" />

In [32]:
nlp.pipe_names

[]

nlp.pipe_names is empty array indicating no components in the pipeline. Pipeline is something that starts with a tokenizer

More general diagram for nlp pipeline may look something like below

<img height=300 width=400 src="https://github.com/jalalrahmanov/CodeBasics-nlp-tutorials-/blob/main/5_spacy_lang_processing_pipeline/spacy_loaded_pipeline.jpg?raw=1" />

<h3>Download trained pipeline</h3>

### Colabda bunlara ehtiyac yoxdu

To download trained pipeline use a command such as,

python -m spacy download en_core_web_sm

This downloads the small (sm) pipeline for english language

Further instructions on : https://spacy.io/usage/models#quickstart

In [33]:
nlp = spacy.load("en_core_web_sm")
nlp.pipe_names

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner']

In [34]:
nlp.pipeline

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7ecbe7c993c0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x7ecbe7c998a0>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7ecbe8f534c0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x7ecbe356ed40>),
 ('lemmatizer',
  <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7ecbe2ecc200>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7ecbe8f53220>)]

sm in en_core_web_sm means small. There are other models available as well such as medium, large etc. Check this: https://spacy.io/usage/models#quickstart

In [35]:
doc = nlp("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

Captain  |  proper noun  |  Captain
america  |  proper noun  |  america
ate  |  verb  |  eat
100  |  numeral  |  100
$  |  numeral  |  $
of  |  adposition  |  of
samosa  |  proper noun  |  samosa
.  |  punctuation  |  .
Then  |  adverb  |  then
he  |  pronoun  |  he
said  |  verb  |  say
I  |  pronoun  |  I
can  |  auxiliary  |  can
do  |  verb  |  do
this  |  pronoun  |  this
all  |  determiner  |  all
day  |  noun  |  day
.  |  punctuation  |  .


**Run same code above with a blank pipeline and check what output you see?**

In [36]:
# if there is no added pipeline, it doesn't give any output
import spacy

nlp_blank = spacy.blank("en")

doc = nlp_blank("Captain america ate 100$ of samosa. Then he said I can do this all day.")

for token in doc:
    print(token, " | ", spacy.explain(token.pos_), " | ", token.lemma_)

Captain  |  None  |  
america  |  None  |  
ate  |  None  |  
100  |  None  |  
$  |  None  |  
of  |  None  |  
samosa  |  None  |  
.  |  None  |  
Then  |  None  |  
he  |  None  |  
said  |  None  |  
I  |  None  |  
can  |  None  |  
do  |  None  |  
this  |  None  |  
all  |  None  |  
day  |  None  |  
.  |  None  |  


<h3>Named Entity Recognition (NER)</h3>

In [37]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


In [43]:
# bu gostermir, duzletdim asagida
# sadece vizual olaraq cumlede bu entityleri gostermek ucundur.
from spacy import displacy

displacy.render(doc, style="ent")

'<div class="entities" style="line-height: 2.5; direction: ltr">\n<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Tesla Inc\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>\n</mark>\n is going to acquire twitter for \n<mark class="entity" style="background: #e4e7d2; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    $45 billion\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">MONEY</span>\n</mark>\n</div>'

In [44]:
from spacy import displacy

displacy.render(doc, style="ent", jupyter=True)

In [54]:
from spacy import displacy

displacy.render(doc, style="dep", jupyter=True, options={'distance':100})

In [57]:
from spacy import displacy

doc1 = nlp("This is a sentence.")
doc2 = nlp("This is another sentence.")
html = displacy.render([doc1, doc2], style="dep", page=True, jupyter=True)

<h3>Trained processing pipeline in French</h3>

In [68]:
# error cunki fransiz dili ucun olani yuklemen lazim
nlp = spacy.load("fr_core_news_sm")

You need to install the processing pipeline for french language using this command,

python -m spacy download fr_core_news_sm

!pip install spacy

!python -m spacy download fr_core_news_sm


In [66]:
!pip install spacy
!python -m spacy download fr_core_news_sm

2024-01-08 22:20:17.195400: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-08 22:20:17.195511: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-08 22:20:17.199198: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
Collecting fr-core-news-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.6.0/fr_core_news_sm-3.6.0-py3-none-any.whl (16.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m16.3/16.3 MB[0m [31m31.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: fr-core-news-sm
Successfully insta

In [67]:
nlp = spacy.load("fr_core_news_sm")

In [69]:
doc = nlp("Tesla Inc va racheter Twitter pour $45 milliards de dollars")
for ent in doc.ents:
    print(ent.text, " | ", ent.label_, " | ", spacy.explain(ent.label_))

Tesla Inc  |  ORG  |  Companies, agencies, institutions, etc.
Twitter  |  MISC  |  Miscellaneous entities, e.g. events, nationalities, products or works of art


In [70]:
for token in doc:
    print(token, " | ", token.pos_, " | ", token.lemma_)

Tesla  |  PROPN  |  Tesla
Inc  |  PROPN  |  Inc
va  |  VERB  |  aller
racheter  |  VERB  |  racheter
Twitter  |  VERB  |  twitter
pour  |  ADP  |  pour
$  |  NOUN  |  dollar
45  |  NUM  |  45
milliards  |  NOUN  |  milliard
de  |  ADP  |  de
dollars  |  NOUN  |  dollar


<h3>Adding a component to a blank pipeline</h3>

In [72]:
source_nlp = spacy.load("en_core_web_sm")

nlp = spacy.blank("en")
nlp.add_pipe("ner", source=source_nlp)
nlp.pipe_names

['ner']

In [73]:
doc = nlp("Tesla Inc is going to acquire twitter for $45 billion")
for ent in doc.ents:
    print(ent.text, ent.label_)

Tesla Inc ORG
$45 billion MONEY


In below image you can see sentecizer component in the pipeline

<img height=300 width=400 src="https://github.com/jalalrahmanov/CodeBasics-nlp-tutorials-/blob/main/5_spacy_lang_processing_pipeline/sentecizer.jpg?raw=1" />

<h3>Further reading</h3>

https://spacy.io/usage/processing-pipelines#pipelines