# Analyzing Sentence Structure

Natural Language ToolKit (NLTK) is a comprehensive Python library for natural language processing and text analytics.
The chapter "Analyzing Sentence Structure" in general deals with the ambiguity that natural language is famous for and tries to find ways to cope with the fact that there are an unlimited number of possible sentences, and that there can only be written finite programs to analyze their structures and discover their meanings. 

The chapter deals with answering the following questions: 

    How can we use a formal grammar to describe the structure of an unlimited set of sentences?
    How do we represent the structure of sentences using syntax trees?
    How do parsers analyze a sentence and automatically build a syntax tree?


In this notebook, we will take a practical approach to these topics. Our tasks are:       

    - read the txt file
    - use nltk tokenizer to divide sentences into words, so that each sentence is a list e.g. sentence = ['I', 'am', 'cold]
    - use  pos parsing  to assign tags (verb, noun, adverb) to the words in the list and creates the parsing trees
    - save (write) the sentence trees 

So first, the file "ice_man.txt" is opened and read.


In [None]:
with open('ice_man.txt', 'r', encoding='utf-8') as file:
    data = file.read().rstrip()

After that, the nltk library is downloaded with the following command: 

In [None]:
import nltk


The first task is to use the nltk tokenizer to divide sentences into words, so that each sentence is a list. 

## Tokenization 
FIRST: get rid of punctuation.
A process by which a large quantity of text is divided into smaller parts called tokens. These tokens are very useful for finding patterns and are considered as a base step for stemming and lemmatization. Tokenization also helps to substitute sensitive data elements with non-sensitive data elements.

To tokenize the sample text, it has to be opened (and read) first. The whole text has to be opened as a string, to be able to process it further. 
"Sent_tokenize" splits the whole text into sentences. "word_tokenize()" splits the whole text into tokens. "[word_tokenize(word) for word in sent_tokenize(data)]" divides the text into sentences and splits these sentences into tokens. You can try out all 3 commands below. 

In [None]:
from nltk.tokenize import word_tokenize
word_tokenize(data)


In [None]:
from nltk.tokenize import sent_tokenize
sent_tokenize(data)

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
pos = [word_tokenize(word) for word in sent_tokenize(data)]
print(pos)

The output is a list, containing each sentence as a list. 
The next task is to	use pos parsing to assign tags (verb, noun, adverb etc.) to the words in the list and to create the corresponding parsing trees.

To do so, we tokenize the data again. This time, we use a version of the text, where the punctuation is removed. In the processes before punctuation was needed, to be able to use the "sent_tokenize()" command. 
Then, the module "pos_tag" and "word_tokenize" have to be imported. Eventually, pos parsing is implemented with "pos_tag(tokenizer)".

In [None]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
strdata = tokenizer.tokenize(data)

#the data is transformed to a string again, to be able to use tokenize it in the next step
data2 = ' '.join(strdata)

In [None]:
from nltk import pos_tag
from nltk import word_tokenize

tokenizer = word_tokenize(data2)

# Find all parts of speech in the text
textpos = pos_tag(tokenizer)
textpos

This is one option to tokenize and assign tags to each word. Another possible way to do so is shown below. In this case, the grammar, on the basis of which the pos-tagging is performed, is defined manually. *Abbreviations: DT=Determiner, JJ=Adjective, NN=Noun, IN/p=Preposition, V=Verb*

In [None]:
# Import required libraries
import nltk
#nltk.download('punkt')
#nltk.download('averaged_perceptron_tagger')
from nltk import pos_tag, word_tokenize, RegexpParser


# Find all parts of speech in above sentence
tagged = pos_tag(word_tokenize(data2))

#define grammer: 
grammar= ("""
                    
                    NP: {<DT>?<JJ>*<NN>} #To extract Noun Phrases
                    P: {<IN>}            #To extract Prepositions
                    V: {<V.*>}           #To extract Verbs
                    PP: {<p> <NP>}       #To extract Prepositional Phrases
                    VP: {<V> <NP|PP>*}   #To extract Verb Phrases
                    """)

#Extract all parts of speech from any text
chunker = RegexpParser(grammar)

# Print all parts of speech in above sentence
output = chunker.parse(tagged)
#print('Filename:output.ps', output, file=output)
#listo =set(output)
#listo.print_to_file('output.ps')
print(output)

The next step is to output the graphic representation of the sentence trees.  
To be able to take a look at the graphic representations of the actual syntax trees for each sentence, you can apply the following function. Due to the fact, that it outputs a syntax tree for each sentence it may take some time. 
To accomplish this task, TextBlob is used. TextBlob is a Python (2 and 3) library for processing textual data. With it it is possible to dive into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

First, TextBlob is imported. After that, the txt file is opened and an (empty) array is created. 
Part-of-speech tags can be accessed through the tags property. Use the parse() method to parse the text.
For example, in the code below it is stated, that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN), In = Präpositionen, VDB Verb Phrase

In [None]:
#textblob
from textblob import TextBlob
with open('ice_man.txt', 'rU') as ins:
    array = []
    for line in ins:
        array.append(line)
for i in array:
    wiki = TextBlob(i)
    a=wiki.tags
    sentence = a
    pattern =  """NP: {<DT>?<JJ>*<NN>}   
                    P: {<IN>}            
                    V: {<V.*>}           
                    PP: {<p> <NP>}       
                    VP: {<V> <NP|PP>*}   
                    """
    
    #"""NP: {<DT>?<JJ>*<NN>}
    #VBD: {<VBD>}
    #IN: {<IN>}"""
    NPChunker = nltk.RegexpParser(pattern)
    result = NPChunker.parse(sentence)

    result.draw()
 

The sentence trees can also be created with a nltk module. This version is shown below: 

In [None]:
import os
from nltk.tree import Tree
from nltk.draw.tree import TreeView

t = Tree.draw(textpos)  


Note: I wasn´t able to save the sentence trees, due to the error 'Tree' object has no attribute 'write'

## Sources: 

    - https://www.nltk.org/book/ch08.html
    - https://textblob.readthedocs.io/en/dev/
    