# Assignment a - Linguistics

## Boise State University NLP - Dr. Kennington

### Instructions and Hints:

* For this assignment, we will be looking at tokenization, morphology, and syntax. 
* This will follow in a similar way as the notebook we did in class, though it will be a bit more work. 
* Answer each question (or, in some cases, follow the command)
* Follow the instructions on the corresponding assignment Trello card for submitting your assignment.

In [1]:
import nltk
import pandas as pd
from nltk import word_tokenize
from nltk.stem.snowball import SnowballStemmer

#### We will be using **[Tamarian](https://www.youtube.com/watch?v=ANvlLcOTy6M)** as our example language: 

In [2]:
sentences = [
    'Sinda his face black his eyes red',
    'Tamak',
    'The river Tamak in winter',
    'Darmok and Jalad at Tanagra',
    'Darmok and Jalad on the ocean',
    'Socath his eyes opened',
    'The beast of Tanagra Usani his army Jakka when the walls fell', # extra credit
    'Picard and Dathan at Eladrel',
    'Marab with sails unfurled',
    'Timba his arms open',
    'Timba at rest'
]

### 1. Tokenize the sentences 

* you will need to make sure everything is lower case
* you will need to represent the sentences as a 2D array of words

In [3]:
sentenceFrame = pd.DataFrame(sentences, columns=['sentence'], dtype= str)
#sentenceFrame['lower'] = map(lambda x: x.lower(), sentenceFrame['sentence'])
sentenceFrame['tokenized_sents'] = sentenceFrame.apply(lambda row: word_tokenize(row['sentence']), axis=1)
sentenceFrame['lower'] = map(lambda x: x.lower(), sentenceFrame['sentence'])

In [4]:
stemmer = SnowballStemmer("english")
newSents = []

for sentence in sentences:
    sent = sentence.lower()
    print(sent)
    sent = word_tokenize(sent)
    print(sent)
    sent = [stemmer.stem(word) for word in sent]
    newSents.append(sent)

sinda his face black his eyes red
['sinda', 'his', 'face', 'black', 'his', 'eyes', 'red']
tamak
['tamak']
the river tamak in winter
['the', 'river', 'tamak', 'in', 'winter']
darmok and jalad at tanagra
['darmok', 'and', 'jalad', 'at', 'tanagra']
darmok and jalad on the ocean
['darmok', 'and', 'jalad', 'on', 'the', 'ocean']
socath his eyes opened
['socath', 'his', 'eyes', 'opened']
the beast of tanagra usani his army jakka when the walls fell
['the', 'beast', 'of', 'tanagra', 'usani', 'his', 'army', 'jakka', 'when', 'the', 'walls', 'fell']
picard and dathan at eladrel
['picard', 'and', 'dathan', 'at', 'eladrel']
marab with sails unfurled
['marab', 'with', 'sails', 'unfurled']
timba his arms open
['timba', 'his', 'arms', 'open']
timba at rest
['timba', 'at', 'rest']


### 2. Write a grammar that can parse all of the sentences

* Try to write as few grammar rules as possible
* Use recursion where you can
* Use `S` as the start symbol
* All terminals need to be in quotes


In [5]:
tamarian_grammar = nltk.CFG.fromstring("""
 S -> NP VP | NP 
 PP -> P NP
 NP -> Det N | Det N PP | 'I' | ADJ N | Det N ADJ | N | N PP | N NP | N CJ
 VP -> NP V | VP PP | DT DT | P V | P VP
 ADJ -> Adj | Adj ADJP
 ADJP -> ADJ
 CJ -> C NP
 DT -> Det N ADJ
 Det -> 'the' | 'his'
 N -> Noun | Noun ProperNoun | ProperNoun
 ProperNoun -> 'sinda' | 'tamak' | 'darmok' | 'jalad' | 'tanagra' | 'eladrel' | 'marad' | 'picard' | 'socath' | 'dathan' | 'marab' | 'timba' | 'usani' | 'jakka' 
 Noun -> 'river'| 'face' | 'eye' |  'winter' |  'ocean' | 'sail' | 'arm' | 'armi' | 'beast' | 'wall'
 V -> 'rest' | 'unfurl' | 'open' | 'fell'
 P -> 'with' | 'in' | 'at' | 'on' | 'of' | 'when'
 Adj -> 'black' | 'red' 
 C -> 'and' 
 """)

## 3. Show that your grammar parses all of the sentences

* Use a parser that can use a CFG (NLTK has several) 
* Make sure there is a parse tree for each of the sentences

In [6]:
parser = nltk.ChartParser(tamarian_grammar)

for sentence in newSents:
    print()
    print(sentence)
    print()
    for tree in parser.parse(sentence):
        print(tree)
        print()


['sinda', 'his', 'face', 'black', 'his', 'eye', 'red']

(S
  (NP (N (ProperNoun sinda)))
  (VP
    (DT (Det his) (N (Noun face)) (ADJ (Adj black)))
    (DT (Det his) (N (Noun eye)) (ADJ (Adj red)))))


['tamak']

(S (NP (N (ProperNoun tamak))))


['the', 'river', 'tamak', 'in', 'winter']

(S
  (NP
    (Det the)
    (N (Noun river) (ProperNoun tamak))
    (PP (P in) (NP (N (Noun winter))))))


['darmok', 'and', 'jalad', 'at', 'tanagra']

(S
  (NP
    (N (ProperNoun darmok))
    (CJ
      (C and)
      (NP
        (N (ProperNoun jalad))
        (PP (P at) (NP (N (ProperNoun tanagra))))))))


['darmok', 'and', 'jalad', 'on', 'the', 'ocean']

(S
  (NP
    (N (ProperNoun darmok))
    (CJ
      (C and)
      (NP
        (N (ProperNoun jalad))
        (PP (P on) (NP (Det the) (N (Noun ocean))))))))


['socath', 'his', 'eye', 'open']

(S
  (NP (N (ProperNoun socath)))
  (VP (NP (Det his) (N (Noun eye))) (V open)))


['the', 'beast', 'of', 'tanagra', 'usani', 'his', 'armi', 'jakka', 'when', 't

For questions 4-6, just answer in marktown/raw text. No code necessary.

## 4. Does your parser have full coverage?

Yes, my parser does have full coverage; at least for the strings listed. If we are assuming that is the entirety of the language, or that the entire language follows this grammatical structure, then we can say the designed parser has full coverage.

## 5. Does your parser over-generate?

The parser does not over generate. It might be possible to say that the parser over generates for the extra credit sentence ('The beast of Tanagra Usani his army Jakka when the walls fell'), but looking at the tree it becomes clear that this sentence is ambigious.

## 6. Which sentences are ambiguous? How do you know?

The extra credit sentence seems ambigious because it is difficult to determine which noun owns what. For example: is the army called Jakka? We could could be sure if a sentence seperated on proper nouns, however we have the one example "The river Tamak..." where we have a noun followed by a pronoun. Therefore, ambiguity exists.

## 7. Parse this sentence:

* If you wrote your grammar right, this should be covered. If this isn't covered, then you'll need to go back and change your grammar.

In [7]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'black', 'in', 'winter']

In [8]:
for tree in parser.parse(s):
    print(tree)

(S
  (NP (N (ProperNoun timba)))
  (VP
    (VP
      (DT (Det his) (N (Noun face)) (ADJ (Adj red)))
      (DT (Det his) (N (Noun eye)) (ADJ (Adj black))))
    (PP (P in) (NP (N (Noun winter))))))


## 8. Was your result in Questions 7 ambiguous?

* Answer in markdown or raw text, no code necessary

No, the result is very straight forward; only one tree.

## 9. How expressive is your language?

* Answer in markdown or raw text, no code necessary

Hard to judge what is meant by expressive, however a lot of the language hinges on the use of how nouns are put together, or at least where the ambiguity is. All sentences parse properly, and if the language follows mostly this format, this the grammar should cover most of sentences. I am concerned that my language incorporates Prepositional Phrases into the Noun Phrase category, but I don't think it affects the articulation of the language. If our examples were more complex, it might prove to be an issue. Also, the way this grammar is structured, a sentence can have an infinite string of adjectives. Thats pretty expressive.

## 10. Make the grammar more general by treating POS tags as the terminals

In [9]:

tamarian_grammar = nltk.CFG.fromstring("""
 S -> NP VP | NP 
 PP -> P NP
 NP -> Det N | Det N PP | 'I' | ADJ N | Det N ADJ | N | N PP | N NP | N CJ
 VP -> NP V | VP PP | DT DT | P V | P VP
 ADJ -> Adj | Adj ADJP
 ADJP -> ADJ
 DT -> Det N ADJ
 CJ -> C NP
 Det -> 'Det'
 N -> Noun | Noun Pronoun | Pronoun
 Pronoun -> 'ProperNoun' 
 Noun -> 'Noun'
 V -> 'Verb'
 P -> 'Prep' 
 Adj -> 'Adj' 
 C -> 'C'
 """)

## 11. What is your set of POS tags?

* show the list of strings (e.g., ['Adj', ...])



In [10]:
['Det', 'ProperNoun', 'Noun', 'Verb', 'Prep', 'Adj']

['Det', 'ProperNoun', 'Noun', 'Verb', 'Prep', 'Adj']

## 12. Make a list for the POS tags that correspond to the sentence `s` below:

In [11]:
s = ['timba', 'his', 'face', 'red', 'his', 'eye', 'back', 'in', 'winter']
p = ['ProperNoun',  'Det', 'Noun', 'Adj', 'Det', 'Noun', 'Adj', 'Prep', 'Noun']

## 13. Parse the sentence (represented as POS tags)

In [12]:
parser = nltk.ChartParser(tamarian_grammar)

for tree in parser.parse(p):
        print(tree)
        print()

(S
  (NP (N (Pronoun ProperNoun)))
  (VP
    (VP
      (DT (Det Det) (N (Noun Noun)) (ADJ (Adj Adj)))
      (DT (Det Det) (N (Noun Noun)) (ADJ (Adj Adj))))
    (PP (P Prep) (NP (N (Noun Noun))))))



## Extra Credit! Do all of the above questions again, but add the sentence:

'The beast of Tanagra Usani his army Jakka when the walls fell'

*Done!*