___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt') as f:
    doc = nlp(f.read())

In [3]:
# Run this cell to verify it worked:
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I.

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [4]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [5]:
sents = [sent for sent in doc.sents]
print(len(sents))

205


**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [6]:
print(sents[1])

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  


** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [7]:
# CHALLENGE SOLUTION:
for token in sents[1]:
    print(f'{token.text:>{10}} {token.pos_:>{10}} {token.dep_:>{10}} {token.lemma_:>{10}}')

         A        DET        det          a
       man       NOUN      nsubj        man
     stood       VERB       ROOT      stand
      upon      SCONJ       prep       upon
         a        DET        det          a
  railroad       NOUN   compound   railroad
    bridge       NOUN       pobj     bridge
        in        ADP       prep         in
  northern        ADJ       amod   northern
   Alabama      PROPN       pobj    Alabama
         ,      PUNCT      punct          ,
   looking       VERB      advcl       look
      down        ADV     advmod       down
         
      SPACE        dep          

      into        ADP       prep       into
       the        DET        det        the
     swift        ADJ       amod      swift
     water       NOUN       pobj      water
    twenty        NUM     nummod     twenty
      feet       NOUN   npadvmod       foot
     below        ADV     advmod      below
         .      PUNCT      punct          .
                SPACE        dep

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [10]:
# Import the Matcher library:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [13]:
# Create a pattern and add it to matcher:
pattern = [{'LOWER':'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER':'vigorously'}]

# Add the new pattern to the Prac matcher
matcher.add('Prac', [pattern])

In [12]:
matcher.remove('Prac')

In [14]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)
print(found_matches)

[(11930998831458770932, 1274, 1277), (11930998831458770932, 3609, 3612)]


In [15]:
sent_idx = []
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]
    span = doc[start:end]
    sent_idx.append(end)
    print(f'{string_id:{5}} {start:{5}} {end:{5}} {span.text:{20}}')

Prac   1274  1277 swimming
vigorously 
Prac   3609  3612 swimming
vigorously 


In [16]:
for sent in sents:
    print(sent.start, sent.end)

0 13
13 36
36 54
54 63
63 88
88 138
138 161
161 167
167 233
233 274
274 309
309 318
318 366
366 423
423 453
453 472
472 484
484 505
505 527
527 557
557 574
574 594
594 616
616 661
661 705
705 713
713 735
735 761
761 787
787 821
821 840
840 866
866 895
895 909
909 922
922 953
953 973
973 981
981 987
987 1006
1006 1049
1049 1061
1061 1109
1109 1128
1128 1146
1146 1164
1164 1179
1179 1193
1193 1212
1212 1224
1224 1237
1237 1265
1265 1292
1292 1321
1321 1360
1360 1366
1366 1388
1388 1417
1417 1488
1488 1507
1507 1515
1515 1590
1590 1630
1630 1647
1647 1671
1671 1695
1695 1718
1718 1756
1756 1761
1761 1775
1775 1779
1779 1784
1784 1826
1826 1868
1868 1875
1875 1888
1888 1918
1918 1929
1929 1944
1944 1959
1959 1985
1985 1993
1993 2015
2015 2051
2051 2074
2074 2097
2097 2113
2113 2134
2134 2143
2143 2168
2168 2174
2174 2211
2211 2252
2252 2277
2277 2303
2303 2321
2321 2346
2346 2368
2368 2402
2402 2431
2431 2445
2445 2475
2475 2502
2502 2513
2513 2522
2522 2525
2525 2551
2551 2576
2576 2597
2

In [17]:
sent_idx

[1277, 3612]

**7. Print the text surrounding each found match**

In [18]:
for sent in sents:
    if sent.end > sent_idx[0]:
        print(sent)
        print()
        del sent_idx[0]
        if len(sent_idx) == 0:
            break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  



**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [19]:
for sent in sents:
    if found_matches[0][1] < sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [20]:
for sent in sents:
    if found_matches[1][1] < sent.end:
        print(sent)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


### Great Job!