# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
# Enter your code here:
with open ('owlcreek.txt') as f:
    doc = nlp(f.read())

In [3]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [4]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [5]:
sentence_list = []
for sent in doc.sents:
    sentence_list.append(sent)

len(sentence_list)

204

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [6]:
sentence_list[1]

The man's hands were behind
his back, the wrists bound with a cord.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [7]:
# NORMAL SOLUTION:
for token in sentence_list[1]:
    print(token.text, token.pos_, token.dep_, token.lemma_)

The DET det the
man NOUN poss man
's PART case 's
hands NOUN nsubj hand
were AUX ROOT be
behind ADP prep behind

 SPACE dep 

his PRON poss his
back NOUN pobj back
, PUNCT punct ,
the DET det the
wrists NOUN appos wrist
bound VERB acl bind
with ADP prep with
a DET det a
cord NOUN pobj cord
. PUNCT punct .
  SPACE dep  


In [8]:
# CHALLENGE SOLUTION:
for token in sentence_list[1]:
    print(f'{token.text:{12}} {token.pos_:{6}} {token.dep_:{8}} {token.lemma:<{22}} {token.lemma_}')


The          DET    det      7425985699627899538    the
man          NOUN   poss     3104811030673030468    man
's           PART   case     16428057658620181782   's
hands        NOUN   nsubj    10690717480206833971   hand
were         AUX    ROOT     10382539506755952630   be
behind       ADP    prep     9368086581607646285    behind

            SPACE  dep      962983613142996970     

his          PRON   poss     2661093235354845946    his
back         NOUN   pobj     15255859468896132977   back
,            PUNCT  punct    2593208677638477497    ,
the          DET    det      7425985699627899538    the
wrists       NOUN   appos    40049004327531306      wrist
bound        VERB   acl      16578919470474021089   bind
with         ADP    prep     12510949447758279278   with
a            DET    det      11901859001352538922   a
cord         NOUN   pobj     8978229881582480158    cord
.            PUNCT  punct    12646065887601541794   .
             SPACE  dep      8532415787641010193

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [9]:
# Import the Matcher library:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [10]:
# Create a pattern and add it to matcher:
pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True}, {'LOWER': 'vigorously'}]
matcher.add('SwimmingVigorously', [pattern])

In [11]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)
print(found_matches)

[(13245044497498710760, 1274, 1277), (13245044497498710760, 3609, 3612)]


**7. Print the text surrounding each found match**

In [12]:
sents = [sent for sent in doc.sents]
for sent in sents:
        if found_matches[0][1] < sent.end:  
            found_sent_match = sent
            print(sent)
            break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [13]:
for sent in sents:
        if found_matches[1][1] < sent.end:  
            found_sent_match2 = sent
            print(sent)
            break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [14]:
print(found_sent_match)




By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [15]:
print(found_sent_match2)




The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
