-------------------
#### Extracting Noun Chunks - using spacy
---------------------

In [19]:
import spacy
from prettytable import PrettyTable

In [20]:
nlp = spacy.load('en_core_web_sm')

In [21]:
def read_text_file(filename):
    file = open(filename, "r", encoding="utf-8") 
    return file.read()

In [22]:
location = r'D:\AI-DATASETS\01-MISC\Sherlock-Holmes.txt'

In [23]:
text = read_text_file(location)

Initialize the spacy engine and then use it to process the text:

In [24]:
doc = nlp(text)

The noun chunks are contained in the `doc.noun_chunks` class variable. We can print out the chunks

In [25]:
all_noun_chunks = []

for noun_chunk in doc.noun_chunks:
    all_noun_chunks.append(noun_chunk.text)
    
len(all_noun_chunks)

30566

#### How it works…
- The spaCy `Doc` object contains information about grammatical relationships between words in a sentence. 
    - Using this information, spaCy determines `noun phrases` or `chunks` contained in the text.

In [26]:
sentence = "All emotions, and that one particularly, were abhorrent to his cold, precise but admirably balanced mind."

In [27]:
doc = nlp(sentence)

In [28]:
for noun_chunk in doc.noun_chunks:
    print(noun_chunk.text)

All emotions
his cold, precise but admirably balanced mind


In [29]:
# Specify the Column Names while initializing the Table
nounTable = PrettyTable(["Token", "Noun chunk start", "Noun chunk end"])
nounTable.align="l"

for noun_chunk in doc.noun_chunks:
    nounTable.add_row([noun_chunk.text, noun_chunk.start, noun_chunk.end])

In [30]:
print(nounTable)

+-----------------------------------------------+------------------+----------------+
| Token                                         | Noun chunk start | Noun chunk end |
+-----------------------------------------------+------------------+----------------+
| All emotions                                  | 0                | 2              |
| his cold, precise but admirably balanced mind | 11               | 19             |
+-----------------------------------------------+------------------+----------------+


Just like a sentence, any `noun chunk` includes a `root`, which is the token that all other tokens depend on. 

In a noun phrase, that is the `noun`:

In [31]:
# Specify the Column Names while initializing the Table
noun_root_Table = PrettyTable(["Token", "Root"])

for noun_chunk in doc.noun_chunks:
    
    noun_root_Table.add_row([noun_chunk.text, noun_chunk.root.text])

print(noun_root_Table)

+-----------------------------------------------+----------+
|                     Token                     |   Root   |
+-----------------------------------------------+----------+
|                  All emotions                 | emotions |
| his cold, precise but admirably balanced mind |   mind   |
+-----------------------------------------------+----------+


#### Similarity

In [32]:
other_span = "emotions"
other_doc = nlp(other_span)

We can now compare it to the noun chunks in the sentence

In [33]:
# Specify the Column Names while initializing the Table
sim_Table = PrettyTable(["Token to compare", "Noun chunk", "similarity score"])

for noun_chunk in doc.noun_chunks:
    
    sim_Table.add_row([other_span, noun_chunk.text, noun_chunk.similarity(other_doc)])
    
print(sim_Table)

+------------------+-----------------------------------------------+----------------------+
| Token to compare |                   Noun chunk                  |   similarity score   |
+------------------+-----------------------------------------------+----------------------+
|     emotions     |                  All emotions                 |  0.4026422809451551  |
|     emotions     | his cold, precise but admirably balanced mind | -0.03689126143699988 |
+------------------+-----------------------------------------------+----------------------+


  sim_Table.add_row([other_span, noun_chunk.text, noun_chunk.similarity(other_doc)])
