<a href="https://colab.research.google.com/github/motabha1/NLP-Homework/blob/assignment-1/NLP_HW1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Phrase Extraction from a paragraph**

The approach here is dividied into two steps:


1.   For each sentence of the paragraph, we will find its corresponding parse tree
2.   The resulting leaves of subtrees with labels Verb Phrase (VP), Noun Phrase(NP) and Prepositional Phrase (PP) will give us the phrases we require





In [None]:
# Part of Stanford NLP 

!pip install stanza

In [None]:
# Here we will be Stanford Core NLP client for creating a parse tree
# To use it in google colab we will have to start a background Java Core NLP process and initialize a client in Python
# This client will be further used to generate annotated result from text input

from nltk import Tree
import stanza
from stanza.server import CoreNLPClient
import time

stanza.install_corenlp()

# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
    memory='4G', 
    endpoint='http://localhost:9001',
    be_quiet=True)
print(client)

In [None]:
# This is the runner method for our homework problem
# Here we will take a paragraph as input, the resulting phrases extracted from the paragraph are stored in the "phrases" list 



# This method will return the parse trees for paragraph
# First the paragraph is broken into sentences and then a parse tree is made for each of the resulting sentence and added to the trees list
# From each of the tree (obtained from sentence), we will be generating phrases
def getParseTrees(txt):
  trees = []
  output = client.annotate(txt, properties={
      'annotators': 'parse',
      'outputFormat': 'json'
    })
  for sent in output['sentences']:
    tree = Tree.fromstring(sent['parse'])
    trees.append(tree)
  return trees 



# This is the main method that extract information from the parse trees
# Here the parse tree and label are passed as arguments
# Our code will look for each of the subtrees and extract terminal nodes of those subtrees which have the required labels
# The leaves are then joined and made into a string which is our required phrase
def getPhrases(tree, pt):
  phrases = []
  for subtree in tree.subtrees():
    if(subtree.label() == pt and type(subtree) == Tree):
      phrases.append(' '.join(subtree.leaves()))
  return phrases


text = input()

phrases = []
trees = getParseTrees(text)

# These are the type of labels that CoreNLPClient will generate in a parse tree
phraseTypes = ["NP", "VP", "PP"]

# For each tree (sentence) we will find a phrase and append to our main list
for tree in trees:
  for pt in phraseTypes:
    phrases = phrases + getPhrases(tree, pt)

for phrase in phrases:
  print(phrase)
