# Get started with SpaCy

Make sure you have downloaded the `en_core_web_sm` model before this!

In [34]:
import spacy
nlp =spacy.load("en_core_web_sm")
print(type(nlp))

<class 'spacy.lang.en.English'>


Import example sentences:

In [32]:
from spacy.lang.en.examples import sentences
for sentence in sentences[0:5]:
  print(sentence)

Apple is looking at buying U.K. startup for $1 billion
Autonomous cars shift insurance liability toward manufacturers
San Francisco considers banning sidewalk delivery robots
London is a big city in the United Kingdom.
Where are you?


Pick the first example sentence:

In [35]:
example_sentence = sentences[0]
print(example_sentence)

Apple is looking at buying U.K. startup for $1 billion


Create an `nlp` object and print its attributes:

In [36]:
doc = nlp(example_sentence)
print(type(doc))

print([attr for attr in dir(doc) if not attr.startswith("__")])

<class 'spacy.tokens.doc.Doc'>
['_', '_bulk_merge', '_py_tokens', '_realloc', '_vector', '_vector_norm', 'cats', 'char_span', 'count_by', 'doc', 'ents', 'extend_tensor', 'from_array', 'from_bytes', 'from_disk', 'get_extension', 'get_lca_matrix', 'has_extension', 'has_vector', 'is_nered', 'is_parsed', 'is_sentenced', 'is_tagged', 'lang', 'lang_', 'mem', 'merge', 'noun_chunks', 'noun_chunks_iterator', 'print_tree', 'remove_extension', 'retokenize', 'sentiment', 'sents', 'set_extension', 'similarity', 'tensor', 'text', 'text_with_ws', 'to_array', 'to_bytes', 'to_disk', 'to_json', 'user_data', 'user_hooks', 'user_span_hooks', 'user_token_hooks', 'vector', 'vector_norm', 'vocab']


Print token attributes created by SpaCy:

In [37]:
for token in doc:
  print(f"Text: {token.text},", f"lemma: {token.lemma_},", f"POS tag: {token.pos_},", "detailed POS tag: {token.tag_},", f"shape: {token.shape_},", f"is alphanumerical: {token.is_alpha},", f"is stop char: {token.is_stop},", f"syntactic dependency: {token.dep_}.")

Text: Apple, lemma: Apple, POS tag: PROPN, detailed POS tag: {token.tag_}, shape: Xxxxx, is alphanumerical: True, is stop char: False, syntactic dependency: nsubj.
Text: is, lemma: be, POS tag: VERB, detailed POS tag: {token.tag_}, shape: xx, is alphanumerical: True, is stop char: True, syntactic dependency: aux.
Text: looking, lemma: look, POS tag: VERB, detailed POS tag: {token.tag_}, shape: xxxx, is alphanumerical: True, is stop char: False, syntactic dependency: ROOT.
Text: at, lemma: at, POS tag: ADP, detailed POS tag: {token.tag_}, shape: xx, is alphanumerical: True, is stop char: True, syntactic dependency: prep.
Text: buying, lemma: buy, POS tag: VERB, detailed POS tag: {token.tag_}, shape: xxxx, is alphanumerical: True, is stop char: False, syntactic dependency: pcomp.
Text: U.K., lemma: U.K., POS tag: PROPN, detailed POS tag: {token.tag_}, shape: X.X., is alphanumerical: False, is stop char: False, syntactic dependency: compound.
Text: startup, lemma: startup, POS tag: NOUN, 

`nlp` object consists of a tokenizer and an NLP pipeline.

In [25]:
nlp.tokenizer, nlp.pipeline

(<spacy.tokenizer.Tokenizer at 0x11766a750>,
 [('tagger', <spacy.pipeline.pipes.Tagger at 0x109eb2e80>),
  ('parser', <spacy.pipeline.pipes.DependencyParser at 0x117cc6228>),
  ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x117cc6288>)])

Sentence segmentation:

In [19]:
doc3 = nlp(u"This is the first sentence. This is another sentence. This is the last sentence.")

In [23]:
for sentence in doc.sents:
  print(sentence)

Tesla is looking at buying U.S. startup for $6 million


Check if token starts a sentence:

In [24]:
doc3[6].is_sent_start

True

### Tokenization examples

In [54]:
mystring = "We're moving to L.A.!"
doc = nlp(mystring)
for t in doc:
  print(t.text)

We
're
moving
to
L.A.
!


Check how tokenization is done:

In [43]:
for token in doc:
  print(token.text)

We
're
moving
to
L.A.
!


In [44]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@example.com or visit us at http://example.com")

In [48]:
for token in doc2:
  print(token, token.pos_, token.dep_)

We PRON nsubj
're VERB ROOT
here ADV advmod
to PART aux
help VERB advcl
! PUNCT punct
Send VERB ROOT
snail NOUN compound
- PUNCT punct
mail NOUN dobj
, PUNCT punct
email NOUN conj
support@example.com X dobj
or CCONJ cc
visit VERB conj
us PRON dobj
at ADP prep
http://example.com X pobj


In [49]:
doc3 = nlp(u"A 5km NYC cab ride costs $10.30")

In [51]:
for t in doc3:
  print(t.text)

A
5
km
NYC
cab
ride
costs
$
10.30


In [52]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")
for token in doc4:
  print(token)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


### Vocabulary

In [57]:
len(doc4.vocab)

516

### Tokenizer
Only use tokenizer. Note that `token.tag_` is empty after tokenizer, it does not perform any parsing.

In [86]:
tokenized = nlp.tokenizer(u"Let's visit St. Louis.")
for token in tokenized:
  print(f"{token.text}, tag: {token.tag_}")

Let, tag: 
's, tag: 
visit, tag: 
St., tag: 
Louis, tag: 
., tag: 


### Named entity recognition

In [87]:
doc = nlp(u"Apple to build a Hong Kong factory for $6 million.")
for token in doc:
  print(token.text, end="|")

Apple|to|build|a|Hong|Kong|factory|for|$|6|million|.|

Print named entities:

In [88]:
for entity in doc.ents:
  print(entity, f"{entity.label_} ({spacy.explain(entity.label_)})", sep=", ")

Apple, ORG (Companies, agencies, institutions, etc.)
Hong Kong, GPE (Countries, cities, states)
$6 million, MONEY (Monetary values, including unit)


### Noun chunks

In [89]:
for chunk in doc.noun_chunks:
  print(chunk)

Apple
a Hong Kong factory


## Visualization

In [95]:
from spacy import displacy

In [96]:
doc = nlp(u"Apple is going to build a U.K. factory for $6 million.")

In [100]:
displacy.render(doc, style="dep", jupyter=True, options={"distance":  110})

In [101]:
displacy.render(doc, style="ent", jupyter=True)