# Introduction to NLTK and SpaCy


## How to read a file from disk



In [1]:
from pathlib import Path

In [2]:
cur_dir = Path().resolve() # this should provide you with the folder in which this notebook is placed
path_to_file = Path.joinpath(cur_dir, 'apple-samsung-example.txt')
print(path_to_file)
print('does path exist? ->', Path.exists(path_to_file))

/home/mihaly/Documents/uni/VU/text_mining/ba-text-mining/lab_sessions/lab1/Lab1-apple-samsung-example.txt
does path exist? -> True


In [3]:
with open(path_to_file) as infile:
    text = infile.read()

print('number of characters', len(text))

number of characters 1139


## Part 1: NLTK

 Using NLTK to apply Part-of-speech (POS) tagging, Named Entity Recognition (NER), and Constituency parsing. The following code snippet already performs sentence splitting and tokenization.

In [4]:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize

In [5]:
sentences_nltk = sent_tokenize(text)

In [6]:
tokens_per_sentence = []
for sentence_nltk in sentences_nltk:
    sent_tokens = word_tokenize(sentence_nltk)
    tokens_per_sentence.append(sent_tokens)

In [7]:
sent_id = 1
print('SENTENCE', sentences_nltk[sent_id])
print('TOKENS', tokens_per_sentence[sent_id])

SENTENCE The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.
TOKENS ['The', 'six', 'phones', 'and', 'tablets', 'affected', 'are', 'the', 'Galaxy', 'S', 'III', ',', 'running', 'the', 'new', 'Jelly', 'Bean', 'system', ',', 'the', 'Galaxy', 'Tab', '8.9', 'Wifi', 'tablet', ',', 'the', 'Galaxy', 'Tab', '2', '10.1', ',', 'Galaxy', 'Rugby', 'Pro', 'and', 'Galaxy', 'S', 'III', 'mini', '.']


### Part 1a: Part-of-speech (POS) tagging

In [8]:
pos_tags_per_sentence = []
for tokens in tokens_per_sentence:
    pos_tagged = nltk.pos_tag(tokens)
    pos_tags_per_sentence.append(pos_tagged)
    #print(pos_tagged)

In [9]:
print(pos_tags_per_sentence)

[[('https', 'NN'), (':', ':'), ('//www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html', 'JJ'), ('Documents', 'NNS'), ('filed', 'VBN'), ('to', 'TO'), ('the', 'DT'), ('San', 'NNP'), ('Jose', 'NNP'), ('federal', 'JJ'), ('court', 'NN'), ('in', 'IN'), ('California', 'NNP'), ('on', 'IN'), ('November', 'NNP'), ('23', 'CD'), ('list', 'NN'), ('six', 'CD'), ('Samsung', 'NNP'), ('products', 'NNS'), ('running', 'VBG'), ('the', 'DT'), ('``', '``'), ('Jelly', 'RB'), ('Bean', 'NNP'), ("''", "''"), ('and', 'CC'), ('``', '``'), ('Ice', 'NNP'), ('Cream', 'NNP'), ('Sandwich', 'NNP'), ("''", "''"), ('operating', 'VBG'), ('systems', 'NNS'), (',', ','), ('which', 'WDT'), ('Apple', 'NNP'), ('claims', 'VBZ'), ('infringe', 'VB'), ('its', 'PRP$'), ('patents', 'NNS'), ('.', '.')], [('The', 'DT'), ('six', 'CD'), ('phones', 'NNS'), ('and', 'CC'), ('tablets', 'NNS'), ('affected', 'VBN'), ('are', 'VBP'), ('the', 'DT'), ('Galaxy', 'NNP'), ('S', 'NNP'), ('III', 'NN

### Part 1b: Named Entity Recognition (NER)


In [10]:
#TO DELETE
from nltk.chunk import ne_chunk

text = '''In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices. Samsung, which is the world's top mobile phone maker, is appealing the ruling. A similar case in the UK found in Samsung's favour and ordered Apple to publish an apology making clear that the South Korean firm had not copied its iPad when designing its own devices.'''
sentences = nltk.sent_tokenize(text)
for sentence in sentences:
    
    tokens = nltk.word_tokenize(sentence)
    tokens_pos_tagged = nltk.pos_tag(tokens)
    tokens_pos_tagged_and_named_entities = ne_chunk(tokens_pos_tagged)
    print()
    print('ORIGINAL SENTENCE', sentence)
    print('NAMED ENTITY RECOGNITION OUTPUT', tokens_pos_tagged_and_named_entities)


ORIGINAL SENTENCE In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices.
NAMED ENTITY RECOGNITION OUTPUT (S
  In/IN
  (GPE August/NNP)
  ,/,
  (PERSON Samsung/NNP)
  lost/VBD
  a/DT
  (GSP US/NNP)
  patent/NN
  case/NN
  to/TO
  (GPE Apple/NNP)
  and/CC
  was/VBD
  ordered/VBN
  to/TO
  pay/VB
  its/PRP$
  rival/JJ
  $/$
  1.05bn/CD
  (/(
  £0.66bn/NN
  )/)
  in/IN
  damages/NNS
  for/IN
  copying/VBG
  features/NNS
  of/IN
  the/DT
  (ORGANIZATION iPad/NN)
  and/CC
  (ORGANIZATION iPhone/NN)
  in/IN
  its/PRP$
  (GPE Galaxy/NNP)
  range/NN
  of/IN
  devices/NNS
  ./.)

ORIGINAL SENTENCE Samsung, which is the world's top mobile phone maker, is appealing the ruling.
NAMED ENTITY RECOGNITION OUTPUT (S
  (GPE Samsung/NNP)
  ,/,
  which/WDT
  is/VBZ
  the/DT
  world/NN
  's/POS
  top/JJ
  mobile/NN
  phone/NN
  maker/NN
  ,/,
  is/VBZ
  appealing/VBG
  the/D

In [11]:
from nltk.chunk import ne_chunk 
ner_tags_per_sentence = []

#ner_tags_per_sentence = [ne_chunk(pos_tagged) for pos_tagged in pos_tags_per_sentence] 
for pos_tagged in pos_tags_per_sentence:
    ner_tags_per_sentence = ne_chunk(pos_tagged)
    print(ner_tags_per_sentence)

(S
  https/NN
  :/:
  //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ
  Documents/NNS
  filed/VBN
  to/TO
  the/DT
  (ORGANIZATION San/NNP Jose/NNP)
  federal/JJ
  court/NN
  in/IN
  (GPE California/NNP)
  on/IN
  November/NNP
  23/CD
  list/NN
  six/CD
  (ORGANIZATION Samsung/NNP)
  products/NNS
  running/VBG
  the/DT
  ``/``
  Jelly/RB
  (GPE Bean/NNP)
  ''/''
  and/CC
  ``/``
  Ice/NNP
  Cream/NNP
  Sandwich/NNP
  ''/''
  operating/VBG
  systems/NNS
  ,/,
  which/WDT
  (PERSON Apple/NNP)
  claims/VBZ
  infringe/VB
  its/PRP$
  patents/NNS
  ./.)
(S
  The/DT
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  affected/VBN
  are/VBP
  the/DT
  (ORGANIZATION Galaxy/NNP)
  S/NNP
  III/NNP
  ,/,
  running/VBG
  the/DT
  new/JJ
  (PERSON Jelly/NNP Bean/NNP)
  system/NN
  ,/,
  the/DT
  (ORGANIZATION Galaxy/NNP)
  Tab/NNP
  8.9/CD
  Wifi/NNP
  tablet/NN
  ,/,
  the/DT
  (ORGANIZATION Galaxy/NNP)
  Tab/NNP
  2/CD
  10.1/CD
  ,/,
  (PE

### Part 1c: Constituency parsing


In [12]:
constituent_parser = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*''')

In [13]:
constituency_output_per_sentence = []

for sentence in pos_tags_per_sentence:
    constituency_output_per_sentence = constituent_parser.parse(sentence)
    print(constituency_output_per_sentence)

(S
  (NP https/NN)
  :/:
  (NP
    //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ)
  Documents/NNS
  (VP (V filed/VBN))
  to/TO
  (NP the/DT)
  San/NNP
  Jose/NNP
  (NP federal/JJ court/NN)
  (P in/IN)
  California/NNP
  (P on/IN)
  November/NNP
  23/CD
  (NP list/NN)
  six/CD
  Samsung/NNP
  products/NNS
  (VP (V running/VBG) (NP the/DT))
  ``/``
  Jelly/RB
  Bean/NNP
  ''/''
  and/CC
  ``/``
  Ice/NNP
  Cream/NNP
  Sandwich/NNP
  ''/''
  (VP (V operating/VBG))
  systems/NNS
  ,/,
  which/WDT
  Apple/NNP
  (VP (V claims/VBZ))
  (VP (V infringe/VB))
  its/PRP$
  patents/NNS
  ./.)
(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  Galaxy/NNP
  S/NNP
  III/NNP
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  Jelly/NNP
  Bean/NNP
  (NP system/NN)
  ,/,
  (NP the/DT)
  Galaxy/NNP
  Tab/NNP
  8.9/CD
  Wifi/NNP
  (NP tablet/NN)
  ,/,
  (NP the/DT)
  Galaxy/NNP


Augment the RegexpParser so that it also detects Named Entity Phrases (NEP), e.g., that it detects *Galaxy S III* and *Ice Cream Sandwich*

In [14]:
constituent_parser_v2 = nltk.RegexpParser('''
NP: {<DT>? <JJ>* <NN>*} # NP
P: {<IN>}           # Preposition
V: {<V.*>}          # Verb
PP: {<P> <NP>}      # PP -> P NP
VP: {<V> <NP|PP>*}  # VP -> V (NP|PP)*
NEP: {<NNP><NNP>*<CD>*<NNP><CD>*<CD>*}   # asd''')

In [15]:
constituency_v2_output_per_sentence = []

for sentence in pos_tags_per_sentence:
    constituency_v2_output_per_sentence = constituent_parser_v2.parse(sentence)
    print(constituency_v2_output_per_sentence)

(S
  (NP https/NN)
  :/:
  (NP
    //www.telegraph.co.uk/technology/apple/9702716/Apple-Samsung-lawsuit-six-more-products-under-scrutiny.html/JJ)
  Documents/NNS
  (VP (V filed/VBN))
  to/TO
  (NP the/DT)
  (NEP San/NNP Jose/NNP)
  (NP federal/JJ court/NN)
  (P in/IN)
  California/NNP
  (P on/IN)
  November/NNP
  23/CD
  (NP list/NN)
  six/CD
  Samsung/NNP
  products/NNS
  (VP (V running/VBG) (NP the/DT))
  ``/``
  Jelly/RB
  Bean/NNP
  ''/''
  and/CC
  ``/``
  (NEP Ice/NNP Cream/NNP Sandwich/NNP)
  ''/''
  (VP (V operating/VBG))
  systems/NNS
  ,/,
  which/WDT
  Apple/NNP
  (VP (V claims/VBZ))
  (VP (V infringe/VB))
  its/PRP$
  patents/NNS
  ./.)
(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  (NEP Galaxy/NNP S/NNP III/NNP)
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  (NEP Jelly/NNP Bean/NNP)
  (NP system/NN)
  ,/,
  (NP the/DT)
  (NEP Galaxy/NNP Tab/NNP 8.9/CD Wifi/NNP)
  (NP tablet/NN)
  ,/,
  (NP the/DT)
 

## Part 2: spaCy


In [16]:
import spacy
nlp = spacy.load('en_core_web_sm')

In [17]:
doc = nlp(text) # insert code here

In [18]:
#POS tagging
spcy_pos_tagged = []
for token in doc:
    print(token.text, token.pos_, token.tag_)

In ADP IN
August PROPN NNP
, PUNCT ,
Samsung PROPN NNP
lost VERB VBD
a DET DT
US PROPN NNP
patent NOUN NN
case NOUN NN
to ADP IN
Apple PROPN NNP
and CCONJ CC
was AUX VBD
ordered VERB VBN
to PART TO
pay VERB VB
its PRON PRP$
rival NOUN NN
$ SYM $
1.05bn NUM CD
( PUNCT -LRB-
£ SYM $
0.66bn NOUN NN
) PUNCT -RRB-
in ADP IN
damages NOUN NNS
for ADP IN
copying VERB VBG
features NOUN NNS
of ADP IN
the DET DT
iPad PROPN NNP
and CCONJ CC
iPhone PROPN NNP
in ADP IN
its PRON PRP$
Galaxy PROPN NNP
range NOUN NN
of ADP IN
devices NOUN NNS
. PUNCT .
Samsung PROPN NNP
, PUNCT ,
which PRON WDT
is AUX VBZ
the DET DT
world NOUN NN
's PART POS
top ADJ JJ
mobile ADJ JJ
phone NOUN NN
maker NOUN NN
, PUNCT ,
is AUX VBZ
appealing VERB VBG
the DET DT
ruling NOUN NN
. PUNCT .
A DET DT
similar ADJ JJ
case NOUN NN
in ADP IN
the DET DT
UK PROPN NNP
found VERB VBN
in ADP IN
Samsung PROPN NNP
's PART POS
favour NOUN NN
and CCONJ CC
ordered VERB VBD
Apple PROPN NNP
to PART TO
publish VERB VB
an DET DT
apology NOUN N

In [19]:
# Named Entity Recognition (NER)
from spacy import displacy

displacy.render(doc, jupyter=True, style='ent')

for ent in doc.ents:
    print(ent.text, ent.label_)

August DATE
Samsung ORG
US GPE
Apple ORG
1.05bn MONEY
0.66bn MONEY
iPad ORG
Galaxy FAC
Samsung ORG
UK GPE
Samsung ORG
Apple ORG
South Korean NORP
iPad ORG


## Part 3: Comparison NLTK and spaCy


### Comparison of Part of speech tagging


In [20]:
#sentence splitting in spaCy:
for index, sentence in enumerate(doc.sents, 1):
    print(f'SENTENCE: {index} {sentence}')

SENTENCE: 1 In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices.
SENTENCE: 2 Samsung, which is the world's top mobile phone maker, is appealing the ruling.
SENTENCE: 3 A similar case in the UK found in Samsung's favour and ordered Apple to publish an apology making clear that the South Korean firm had not copied its iPad when designing its own devices.


In [21]:
# sentence splitting in NLTK
nltk_sentence_splitted = sent_tokenize(text)
for index, sentence in enumerate(nltk_sentence_splitted, 1):
    print(f'SENTENCE: {index} {sentence}')

SENTENCE: 1 In August, Samsung lost a US patent case to Apple and was ordered to pay its rival $1.05bn (£0.66bn) in damages for copying features of the iPad and iPhone in its Galaxy range of devices.
SENTENCE: 2 Samsung, which is the world's top mobile phone maker, is appealing the ruling.
SENTENCE: 3 A similar case in the UK found in Samsung's favour and ordered Apple to publish an apology making clear that the South Korean firm had not copied its iPad when designing its own devices.


As can we see the sentence splitting is fairly similar. However if we look closely, we can see that SpaCy made a small mistake in sentence 4, where it included a quote mark from the previous sentence.

LTK tags the URL as NN (noun) for "https" and JJ (adjective) for the rest, paCy treats the entire URL as a NOUN. For "Jelly" and "Bean", NLTK tags "Jelly" as RB (adverb), which is incorrect in this context, and "Bean" as NNP. In contrast, spaCy correctly identifies both "Jelly" and "Bean" as PROPN. NLTK tags "operating" as VBG, while spaCy tags it as NOUN, which might not fully capture its role in this sentence as part of the phrase "operating systems". They both recognize numerical values, conjunctions, and determiners similarly. Generally, spaCy seems to provide a more context-aware analysis in some cases, particularly with proper nouns and entity names, probably due to its more sophisticated modeling. NLTK offers straightforward tagging based on simpler models, which can sometimes lead to misclassifications in complex or ambiguous contexts.


### Comparison of Named Entity Recognition (NER)


spaCy tends to identify a broader range of entity types with more specific labels (e.g., DATE, CARDINAL, MONEY, LAW, FAC for facility, NORP for nationalities, religious or political groups). NLTK, through its ne_chunk method, primarily focuses on traditional categories like PERSON, ORGANIZATION, GPE. SpaCy identified "San Jose" and "California" as GPE, similar to spaCy, but it does not distinguish between dates, numbers, or monetary values with specific labels. spaCy recognized "the 'Jelly Bean'" as LAW, which is a misclassification, however, it correctly identifies a wide range of entities, including "Galaxy Rugby Pro" as ORG. paCy appears to perform better for this specific task, given its ability to recognize a wider variety of entity types with more granularity and accuracy, despite some occasional misclassifications. Its output is more informative for detailed text analysis, providing insights into not just who and where, but also when and how much, among other details. However, the choice between NLTK and spaCy should be based on the specific requirements of the given project.

### Comparison of Constituency/dependency parsing
Choose one sentence from the text and run constituency parsing using NLTK and dependency parsing using spaCy.
* describe briefly the difference between constituency parsing and dependency parsing
* describe differences between the output from NLTK and spaCy.

In [25]:
#NLTK constituency parsing

constituency_v2_output_per_sentence = constituent_parser_v2.parse(pos_tags_per_sentence[1])
print(constituency_v2_output_per_sentence)

(S
  (NP The/DT)
  six/CD
  phones/NNS
  and/CC
  tablets/NNS
  (VP (V affected/VBN))
  (VP (V are/VBP) (NP the/DT))
  (NEP Galaxy/NNP S/NNP III/NNP)
  ,/,
  (VP (V running/VBG) (NP the/DT new/JJ))
  (NEP Jelly/NNP Bean/NNP)
  (NP system/NN)
  ,/,
  (NP the/DT)
  (NEP Galaxy/NNP Tab/NNP 8.9/CD Wifi/NNP)
  (NP tablet/NN)
  ,/,
  (NP the/DT)
  (NEP Galaxy/NNP Tab/NNP 2/CD 10.1/CD)
  ,/,
  (NEP Galaxy/NNP Rugby/NNP Pro/NNP)
  and/CC
  (NEP Galaxy/NNP S/NNP III/NNP)
  (NP mini/NN)
  ./.)


In [27]:
#spaCy dependency parsing
doc2 = nlp(u"The six phones and tablets affected are the Galaxy S III, running the new Jelly Bean system, the Galaxy Tab 8.9 Wifi tablet, the Galaxy Tab 2 10.1, Galaxy Rugby Pro and Galaxy S III mini.")
displacy.render(doc2, jupyter=True, style='dep')
spacy.explain('CCONJ')

'coordinating conjunction'

Constituent parsing tries to identify the constituent (sub-phrase) structure of the sentence (e.g. verb phrases), while dependency parsing focuses on the relationships between words in a sentence (e.g. who is the subject and how part of the sentence depend on each other).

As described in the differences, the outputs are different. Apart from the obvious differences (textual vs graphical), the annotation used by the different libraries are also different.