# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt', "r") as file:
    text = file.read()

doc = nlp(text)



In [3]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [4]:
print("Number of tokens:", len(doc))

Number of tokens: 4835


**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [5]:

sentences = list(doc.sents)
print("Number of sentences:", len(sentences))

Number of sentences: 204


**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [6]:
sentences[1]

The man's hands were behind
his back, the wrists bound with a cord.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [7]:
# NORMAL SOLUTION:
for token in sentences[1]:
    print(f"{token.text}\t{token.pos_}\t{token.dep_}\t{token.lemma_}")

The	DET	det	the
man	NOUN	poss	man
's	PART	case	's
hands	NOUN	nsubj	hand
were	AUX	ROOT	be
behind	ADP	prep	behind

	SPACE	dep	

his	PRON	poss	his
back	NOUN	pobj	back
,	PUNCT	punct	,
the	DET	det	the
wrists	NOUN	appos	wrist
bound	VERB	acl	bind
with	ADP	prep	with
a	DET	det	a
cord	NOUN	pobj	cord
.	PUNCT	punct	.
 	SPACE	dep	 


In [8]:
# CHALLENGE SOLUTION:
col_widths = [15, 15, 15, 15]

print("{:<{}}{:<{}}{:<{}}{:<{}}".format("Text", col_widths[0], "POS", col_widths[1], "dep", col_widths[2], "lemma", col_widths[3]))

for token in sentences[1]:
    print("{:<{}}{:<{}}{:<{}}{:<{}}".format(token.text, col_widths[0], token.pos_, col_widths[1], token.dep_, col_widths[2], token.lemma_, col_widths[3]))

Text           POS            dep            lemma          
The            DET            det            the            
man            NOUN           poss           man            
's             PART           case           's             
hands          NOUN           nsubj          hand           
were           AUX            ROOT           be             
behind         ADP            prep           behind         

              SPACE          dep            
              
his            PRON           poss           his            
back           NOUN           pobj           back           
,              PUNCT          punct          ,              
the            DET            det            the            
wrists         NOUN           appos          wrist          
bound          VERB           acl            bind           
with           ADP            prep           with           
a              DET            det            a              
cord           NOUN     

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [9]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [10]:
# Create a pattern and add it to matcher:
pattern = [{"LOWER": "swimming"}, {"IS_SPACE": True, "OP": "*"}, {"LOWER": "vigorously"}]
matcher.add("Swimming", [pattern])

matches = matcher(doc)

In [11]:
# Create a list of matches called "found_matches" and print the list:
found_matches = [(doc[start:end], start, end) for match_id, start, end in matches]

**7. Print the text surrounding each found match**

In [12]:
for match in found_matches:
    start = match[1] - 5  # 5 words before the match
    end = match[2] + 5  # 5 words after the match
    surrounding_text = doc[start:end]
    print(surrounding_text)

evade the bullets and, swimming
vigorously, reach the bank,
shoulder; he was now swimming
vigorously with the current.  


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [13]:
for match in found_matches:
    start = match[1]
    end = match[2]
    sentence = next(sent for sent in sentences if start <= sent.end and end >= sent.start)
    print(sentence)

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  
The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
