In [25]:
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     -- ------------------------------------- 0.8/12.8 MB 5.6 MB/s eta 0:00:03
     -------- ------------------------------- 2.6/12.8 MB 7.6 MB/s eta 0:00:02
     -------------- ------------------------- 4.7/12.8 MB 8.4 MB/s eta 0:00:01
     ---------------------- ----------------- 7.3/12.8 MB 9.2 MB/s eta 0:00:01
     ------------------------------ --------- 9.7/12.8 MB 9.7 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 10.5 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 9.7 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [26]:
import spacy

Part 1:

In [27]:
nlp=spacy.load('en_core_web_sm')
my_statement="The quick brown fox doesn't jump over the lazy dog. Natural Language Processing is fascinating!"
my_statement_nlp=nlp(my_statement)
tokens_spacy=[token.text for token in my_statement_nlp]
print(tokens_spacy)

['The', 'quick', 'brown', 'fox', 'does', "n't", 'jump', 'over', 'the', 'lazy', 'dog', '.', 'Natural', 'Language', 'Processing', 'is', 'fascinating', '!']


In [28]:
for token in my_statement_nlp:
    print(f"{token.text} - {token.lemma_}: {token.head}:{token.morph}")

The - the: fox:Definite=Def|PronType=Art
quick - quick: fox:Degree=Pos
brown - brown: fox:Degree=Pos
fox - fox: jump:Number=Sing
does - do: jump:Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
n't - not: jump:Polarity=Neg
jump - jump: jump:VerbForm=Inf
over - over: jump:
the - the: dog:Definite=Def|PronType=Art
lazy - lazy: dog:Degree=Pos
dog - dog: over:Number=Sing
. - .: jump:PunctType=Peri
Natural - Natural: Language:Number=Sing
Language - Language: Processing:Number=Sing
Processing - processing: is:Number=Sing
is - be: is:Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin
fascinating - fascinating: is:Degree=Pos
! - !: is:PunctType=Peri


SpaCy tokenizes text by splitting it into words, punctuation, and special characters. Each token is assigned attributes like part of speech (POS), lemma (base form), and morphological features. This structured processing allows for deeper language understanding and enables tasks like dependency parsing, named entity recognition (NER), and sentiment analysis.

SpaCy treats punctuation as separate tokens, meaning that periods, commas, exclamation marks, and other punctuation marks are not attached to words. Periods and exclamation marks share the same morphological feature of PunctType=peri.

Contractions are broken into separate tokens for better analysis. By separating contractions, SpaCy ensures more accurate lemmatization, part-of-speech tagging, and syntactic parsing, making it easier to analyze words and sentence meaning.

Part 2:

In [29]:
for token in my_statement_nlp:
    print(f"{token.text} - {token.pos_}: {token.tag_}")

The - DET: DT
quick - ADJ: JJ
brown - ADJ: JJ
fox - NOUN: NN
does - AUX: VBZ
n't - PART: RB
jump - VERB: VB
over - ADP: IN
the - DET: DT
lazy - ADJ: JJ
dog - NOUN: NN
. - PUNCT: .
Natural - PROPN: NNP
Language - PROPN: NNP
Processing - NOUN: NN
is - AUX: VBZ
fascinating - ADJ: JJ
! - PUNCT: .


"quick" -> ADJ (Adjective)

"jumps" -> VERB (Verb)

"is" -> AUX (Auxiliary Verb)

POS tagging provides structural and grammatical insights into a sentence, it is good for error correction, identifying the mean, and accurate translation in NLP applications

Part 3:

In [30]:
my_new_statement="Barack Obama was the 44th President of the United States. He was born in Hawaii."
my_new_statement_nlp=nlp(my_new_statement)
for ent in my_new_statement_nlp.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

Barack Obama: PERSON (People, including fictional)
44th: ORDINAL ("first", "second", etc.)
the United States: GPE (Countries, cities, states)
Hawaii: GPE (Countries, cities, states)


SpaCy reconizes a lot of entities, such as person, geopolitical entity (city, country, states), organizations, ordinal numbers, cardinal numbers, etc. 

"Barack Obama"-> PERSON

"Hawaii" -> GPE (Geopolitical Entity)

Part 4:

In [None]:
chosen_statement="The Arctic Monkeys released the AM album in 2013."
chosen_statement_nlp=nlp(chosen_statement)

In [32]:
for ent in chosen_statement_nlp.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

The Arctic Monkeys: ORG (Companies, agencies, institutions, etc.)
2013: DATE (Absolute or relative dates or periods)


In [33]:
for token in chosen_statement_nlp:
    print(f"{token.text} - {token.pos_}: {token.tag_}")

The - DET: DT
Arctic - PROPN: NNP
Monkeys - PROPN: NNP
released - VERB: VBD
the - DET: DT
AM - PROPN: NNP
album - NOUN: NN
in - ADP: IN
2013 - NUM: CD
. - PUNCT: .


In [40]:
typo_statement="The arctic monkeys relleased the am album in2013."
typo_statement_nlp=nlp(typo_statement)

In [41]:
for ent in typo_statement_nlp.ents:
    print(f"{ent.text}: {ent.label_} ({spacy.explain(ent.label_)})")

arctic: LOC (Non-GPE locations, mountain ranges, bodies of water)


In [42]:
for token in typo_statement_nlp:
    print(f"{token.text} - {token.pos_}: {token.tag_}")

The - DET: DT
arctic - ADJ: JJ
monkeys - NOUN: NNS
relleased - VERB: VBD
the - DET: DT
am - PROPN: NNP
album - PROPN: NNP
in2013 - PROPN: NNP
. - PUNCT: .


I created a new sentance and ran it once with no typos and proper capitalization and as expected, everything was identitfied correctly. "Arctic Monkeys" and "AM" was a proper noun and 2013 was a number. 

When I made the proper nouns lowercase, "AM" was still labeled a proper noun, but "arctic" and 'monkey' were labeled as an adjective and a noun, respectively  — most likely because both of these words are regular words regardless. Misspelling "released" did not cause any problems, but not including a space between "in" and "2013" made SpaCy identify "in2013" as a proper noun. I think the misspelling of "released" was close enough to the real spelling that SpaCy did not have an issue, but "in2013" was not recognized, so it was labeled as a proper noun.