<a href="https://colab.research.google.com/github/kapilgautamin/Machine-Learning-/blob/master/spacey_basicsAssessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [0]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [0]:
# Enter your code here:
with open("owlcreek.txt") as f:
  text = f.read()
doc = nlp(text)

In [0]:
# Run this cell to verify it worked:
doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [0]:
len(doc)

4833

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [60]:
#Sentences can span lines, so the '\n' won't work
sents = text.split('\n')
len(sents)

376

In [61]:
sents = list(doc.sents)
len(sents)

222

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [0]:
sents[2]

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [24]:
# NORMAL SOLUTION:
for token in doc:
  if token.text not in [',',' ','.',';','!','\n','\t','\"','--']:
    print(f'{token.text:{20}} {spacy.explain(token.pos_):{25}} {token.dep_:{20}} {token.lemma_:{20}}')

AN                   determiner                det                  an                  
OCCURRENCE           noun                      compound             OCCURRENCE          
AT                   adposition                compound             at                  
OWL                  proper noun               compound             OWL                 
CREEK                proper noun               nsubj                CREEK               
BRIDGE               verb                      ROOT                 BRIDGE              


                   space                                          

                  
by                   adposition                prep                 by                  
Ambrose              proper noun               compound             Ambrose             
Bierce               proper noun               pobj                 Bierce              


                   space                                          

                  
I                    

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [0]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [0]:
# Create a pattern and add it to matcher:
pattern1 = [{'Lower':'swimming'},{'IS_SPACE':True},{'Lower':'vigorously'}]
#pattern1 = [{'Lower':'swimming'},{'IS_SPACE':True, 'OP' : '*'},{'Lower':'vigorously'}]
matcher.add('swimming',None,pattern1)

In [53]:
# Create a list of matches called "found_matches" and print the list:
found_matches = matcher(doc)
found_matches

[(12526975369366237900, 1274, 1277), (12526975369366237900, 3607, 3610)]

In [54]:
for match_id, start, end in found_matches:
  string_id = nlp.vocab.strings[match_id]
  span = doc[start:end]
  print(match_id,string_id,start,end,span.text)


12526975369366237900 swimming 1274 1277 swimming
vigorously
12526975369366237900 swimming 3607 3610 swimming
vigorously


**7. Print the text surrounding each found match**

In [55]:
#1274 to 1277
doc[1260:1290]

into the stream.  By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home

In [56]:
#3607 to 3610
doc[3600:3620]

over his shoulder; he was now swimming
vigorously with the current.  His brain was as energetic

**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [57]:
found_matches

[(12526975369366237900, 1274, 1277), (12526975369366237900, 3607, 3610)]

In [62]:
for sent in sents:
  if found_matches[0][1] < sent.end:
    print(sent)
    break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [63]:
for sent in sents:
  if found_matches[1][1] < sent.end:
    print(sent)
    break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
