# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [2]:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [3]:
with open('../TextFiles/owlcreek.txt') as f:
    doc = nlp(f.read())

In [4]:
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [5]:
len(doc)

4835

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [6]:
# !
len(list(doc.sents))

204

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [7]:
# !
list(doc.sents)[1]

The man's hands were behind
his back, the wrists bound with a cord.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [8]:
print(f"{'Text':<15} {'POS':<10} {'Dep':<10} {'Lemma':<15}")

for token in list(doc.sents)[1]:
    print(f"{token.text:<15} {token.pos_:<10} {token.dep_:<10} {token.lemma_:<15}")

Text            POS        Dep        Lemma          
The             DET        det        the            
man             NOUN       poss       man            
's              PART       case       's             
hands           NOUN       nsubj      hand           
were            AUX        ROOT       be             
behind          ADP        prep       behind         

               SPACE      dep        
              
his             PRON       poss       his            
back            NOUN       pobj       back           
,               PUNCT      punct      ,              
the             DET        det        the            
wrists          NOUN       appos      wrist          
bound           VERB       acl        bind           
with            ADP        prep       with           
a               DET        det        a              
cord            NOUN       pobj       cord           
.               PUNCT      punct      .              
                SPACE      d

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [9]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [10]:
# Create a pattern and add it to matcher:
pattern = [{"LOWER": "swimming"}, {"IS_SPACE": True}, {"LOWER": "vigorously"}]

matcher.add("Swimming", [pattern])




In [11]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)
print(found_matches)


[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]


**7. Print the text surrounding each found match**

In [16]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    
    print('Match ', match_id, string_id)
    print(doc[start-12:end+12], '\n')

Match  12881893835109366681 Swimming
stream.  By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away 

Match  12881893835109366681 Swimming
hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  His brain was as energetic as his 



**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [100]:
from spacy.matcher import PhraseMatcher

phrase_matcher = PhraseMatcher(nlp.vocab)

phrase_list = ['swimming vigorously', 'swimming\nvigorously']

phrase_patterns = [nlp(phrase) for phrase in phrase_list]

phrase_matcher.add('SwimmingV', None, *phrase_patterns)

phrase_matches = phrase_matcher(doc)

# (match_id, start, end)
phrase_matches


[(10778749267747703984, 1274, 1277), (10778749267747703984, 3609, 3612)]

In [101]:
# Build a list of sentences
sents = [sent for sent in doc.sents]

for sent in sents:
    for match_id, start, end in phrase_matches:
        if start >= sent.start and end <= sent.end:
            print(sent)
            print('\n')
            break
    


By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


