# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
# Enter your code here:
with open ('sources/owlcreek.txt') as f:
    doc = nlp(f.read())

In [3]:
# Run this cell to verify it worked:
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [4]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [5]:
sents = []
for sentence in doc.sents:
    sents.append(sentence)
len(sents)

229

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [6]:
sents[2]

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [7]:
# NORMAL SOLUTION:
for token in sents[2]:
    print(token.text, token.pos, token.dep, token.lemma)

A 90 415 11901859001352538922
man 92 429 3104811030673030468
stood 100 8206900633647566924 16121235759125543490
upon 98 443 12776617025319584140
a 90 415 11901859001352538922
railroad 92 7037928807040764755 11929338562591612190
bridge 92 439 10505406131357236919
in 85 443 3002984154512732771
northern 84 402 14402328224860809449
Alabama 96 439 1974316624015891830
, 97 445 2593208677638477497
looking 100 399 16096726548953279178
down 86 444 6421409113692203669

 103 0 962983613142996970
into 85 443 3278561384161438710
the 90 415 7425985699627899538
swift 84 402 9502497712543975804
water 92 439 7248544922998488549
twenty 93 12837356684637874264 8304598090389628520
feet 92 428 779410287755165804
below 86 400 13516515296229086732
. 97 445 12646065887601541794
  103 0 8532415787641010193


In [8]:
# CHALLENGE SOLUTION:
for token in sents[2]:
    print(f'{token.text:<{10}} {token.pos:{5}} {token.dep:<{20}} {token.lemma:<{22}} {token.lemma_:{10}}')

A             90 415                  11901859001352538922   a         
man           92 429                  3104811030673030468    man       
stood        100 8206900633647566924  16121235759125543490   stand     
upon          98 443                  12776617025319584140   upon      
a             90 415                  11901859001352538922   a         
railroad      92 7037928807040764755  11929338562591612190   railroad  
bridge        92 439                  10505406131357236919   bridge    
in            85 443                  3002984154512732771    in        
northern      84 402                  14402328224860809449   northern  
Alabama       96 439                  1974316624015891830    Alabama   
,             97 445                  2593208677638477497    ,         
looking      100 399                  16096726548953279178   look      
down          86 444                  6421409113692203669    down      

            103 0                    962983613142996970     
  

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [9]:
# Import the Matcher library:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [10]:
# Create a pattern and add it to matcher:
pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True}, {'LOWER': 'vigorously'}]
matcher.add('SwimmingVigorously', None, pattern)

In [11]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)
print(found_matches)

[(13245044497498710760, 1274, 1277), (13245044497498710760, 3609, 3612)]


**7. Print the text surrounding each found match**

In [12]:
match_id, start_match1, end_match1 = found_matches[0]
span = doc[start_match1-9:end_match1+13]             
print(span.text)

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home


In [13]:
match_id, start_match2, end_match2 = found_matches[1]
span = doc[start_match2-7:end_match1+5]             
print(span.text)




**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [14]:
for sent in sents:
    if start_match1<sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [15]:
for sent in sents:
    if start_match2<sent.end:
        print(sent)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


### Great Job!