# Jupyter assignment 5: Phrase-structure rules, CALL, POS tagging
## LING-UA 6 

Written by Lucas Champollion - Based in part on questions by Kushal Chattopadhyay

Modified by Mia Jacobsen

*There is no solutions for this notebook since it is less code-driven and more discussion and relfection questions. Please don't hestitate to ask me or your fellow students though if you're in doubt! After class 2, I will still make the 'filled' version of this available, so feel free to have a look there.*

At the end of this lesson, you will be able to:

- Identify constituents such as noun phrases and verb phrases and a simple syntactic analysis on an unknown language
- Identify instances where language technology is likely being used to improve interactive computer-assisted language learning (ICALL) technology
- Recognize lexical ambiguities in "crash blossom" headlines
- Interpret the output of automatic part-of-speech (POS) taggers


In [None]:
# first we install the packages we will need for this exercise
%pip install pandas
%pip install svgling

# 1. Phrase structure rules


After years of searching, you have finally met extraterrestrial life, the Balnians, from a planet they call Balnalilak. Although you are eager to converse with them, you must first figure out how their language works. Can you figure out the phrase structure rules of the alien language? 

In the next few questions, you will apply the model of syntax described in Section 3.5.2 of the textbook (phrase structure grammar) to the Balnian language. Here are helpful hints about the syntax of phrase structure rules (mostly for the non-linguists among you):
- A sentence (S) is composed of a noun phrase (NP --- this is the subject of the sentence), and a verb phrase (VP). Example: 'The reindeer played games' on page 65 of the textbook.
- Noun phrases (NP) are constituents that can be subjects or objects in a sentence. 
- When an adjective (Adj) modifies a noun (N), the two of them together are treated like a noun (N). (An example would be "interesting book". Here "interesting" is an Adj, "book" is a N, and the two of them together are treated like an N.)
- A determiner (Det) and a noun (N) often make up an NP. (Example: "the reindeer" on page 65 of the textbook.)
- When a sentence has an object, its VP consists of a verb (V) and an NP (the object). (Example: "played games" on page 65, it consists of a verb (V) and an object (NP).)
- Different languages differ in their word order. This can be reflected in the way phrase structure rules are put together. For example, in English, the object comes after the verb, but in Japanese, the object comes before the verb. This is reflected in the phrase structure rules in the following way: for English, we have the rule "VP -> V NP", but for Japanese, we have the rule "VP -> NP V". In both cases, the VP consists of the same things, but in a different order.
- While Balnian seems like a human language in many respects, there is no reason to think that it should be exactly like English (or like Japanese) in its word order. I recommend you start with the English phrase structure rules that you can find in the textbook (figure 3.9 in the textbook) and tweak them until they match Balnian. 
- Ignore the difference between V(trans) and V(intrans) that you see in that list. We just treat them both as V to make things simpler.

You know that the following sentences are grammatically correct in Balnian, and that their translation is accurate:

1. ucak tak rom cerkulak groktiak tak ("The probe communicates with the tall spaceship")

2. balniak groktiak tak rom cerkulak tak ("The tall Balnian communicates with the spaceship")

3. zorak tak gob cerkulak tak ("The captain commands the spaceship")

4. ucak tak von balnalilak menak tak ("The probe comes from the distant Balnalilak")

Here is a short dictionary that may prove useful in finding phrase structure rules. Run the cell to show it as a table:

In [1]:
import pandas as pd
import numpy as np

a = ["balnalilak", "Balnalilak (name of a planet)", "N"]
b = ["balniak", "Balnian", "N"]
c = ["cerkulak", "spaceship", "N"]
d = ["gob", "command, have control over", "V"]
e = ["groktiak", "tall", "Adj"]
f = ["menak", "distant", "Adj"]
g = ["rom", "communicate with", "V"]
h = ["tak", "the", "Det"]
i = ['von', 'come from', "V"]
j = ["ucak", "probe", "N"]
k = ["zorak", "captain", "N"]

balnianTable = pd.DataFrame(np.array([a,b,c,d,e,f,g,h,i,j,k]), 
                            columns=["Balnian", "English", "part of speech (POS)"])
balnianTable

Unnamed: 0,Balnian,English,part of speech (POS)
0,balnalilak,Balnalilak (name of a planet),N
1,balniak,Balnian,N
2,cerkulak,spaceship,N
3,gob,"command, have control over",V
4,groktiak,tall,Adj
5,menak,distant,Adj
6,rom,communicate with,V
7,tak,the,Det
8,von,come from,V
9,ucak,probe,N


**Question 1.1 (4 points)** Go through each of the Balnian sentences above and specify all of its noun phrases (NP). For each of the sentences, give its noun phrases as a list of strings. If there is none for a given sentence, specify an empty list, like this: `[]`. The first sentence is already filled out.


In [None]:
NP_list_1 = ["ucak tak","cerkulak groktiak tak"]
NP_list_2 = ...
NP_list_3 = ...
NP_list_4 = ...

**Question 1.2 (3 points)** Now identify the verb phrases (VPs) in the sentences. Since there is only one VP per sentence, give it as a string rather than as a list.

In [None]:
VP_1 = "rom cerkulak groktiak tak"

VP_2 = ...

VP_3 = ...

VP_4 = ...

**Question 1.3 (10 points total)** Now use the labels to complete the syntax trees for your sentences below. Your task is to replace each number with a label taken from the following list: 

`Adj, Det, N, NP, S, V, VP`

(If you have taken classes in syntax, you might be used to different labels, e.g. DP instead of NP, D instead of Det, or TP instead of S. You can use whatever you feel most comfortable with.) <br> Below the following cell is a list of "draw_tree" commands that you can run to look at the trees. I recommend you run these cells once before you add any labels to the trees, and then repeatedly as you add labels. This can help you make sure you add the labels in the right places. The first tree is already filled out.

In [None]:
from svgling import draw_tree # this tells Python how to draw trees

In [None]:
tree_1 = ("S",
          ("NP",
           ("N", "ucak"),
           ("Det", "tak")
          ),
          ("VP",
           ("V", "rom"),
           ("NP",
            ("N",
             ("N", "cerkulak"),
             ("Adj", "groktiak")
            ),
            ("Det", "tak")
           )
          )
         )


In [None]:
# run this cell to see the tree above
draw_tree(tree_1)

**Question 1.3 Part 1 (4 points)** Add labels to the tree for Sentence 2 by replacing the numbers.

In [None]:
tree_2 = ("1", 
          ("2",
           ("3",
            ("4", "balniak"),
            ("5", "groktiak")
           ),
           ("6", "tak")
          ),
          ("7",
           ("8", "rom"),
           ("9",
            ("10", "cerkulak"),
            ("11", "tak")
           )
          )
         ) 

In [2]:
# run this cell to see the tree above
draw_tree(tree_2)

NameError: name 'draw_tree' is not defined

**Question 1.3 Part 2 (3 points)** Add labels to the tree for Sentence 3.

In [None]:
tree_3 = ("1", 
          ("2",
           ("3", "zorak"),
           ("4", "tak")
          ),
          ("5",
           ("6", "gob"),
           ("7",
            ("8", "cerkulak"),
            ("9", "tak")
           )
          )
         )

In [None]:
draw_tree(tree_3)

**Question 1.3 Part 3 (3 points)** Add labels to the tree for Sentence 4.

In [None]:
tree_4 = ("1",
          ("2",
           ("3", "ucak"),
           ("4", "tak")
          ),
          ("5",
           ("6", "von"),
           ("7",
            ("8",
             ("9", "balnalilak"),
             ("10", "menak")
            ),
            ("11", "tak")
           )
          )
         )

In [None]:
draw_tree(tree_4)

<!-- BEGIN QUESTION -->

**Question 1.4 (10 points)** List all the phrase structure rules for Balnian that you can identify, based on the trees you've completed. Skip rules like `N -> ucak` that end in a terminal (i.e. a word). Add the rules after the colons at the end of the prompts, and add further prompts as needed.

1. First rule: 

2. Second rule: 

3. Third rule:

4. Fourth rule:

Etc. 

<!-- END QUESTION -->

# 2. Computer-assisted language learning: Didi


In this exercise, we'll look at a computer-assisted language learning system, Didi (for "Digital Differentiation"), which is used in German schools to help students study English. Didi can be found at [https://didi.schule/buffet](https://didi.schule/buffet). (Notice that there is no .com or .edu within this URL. The word "Schule" means school in German.) 

Didi is designed for native speakers of German who want to learn English, typically middle school students. Its focus is on digital differentiation, that is, providing an individualized learning experience where materials are automatically sequenced based on the learner model, pedagogical criteria, and linguistic/pedagogical task goals. In the current early version, exercises are provided in a menu arranged by grammar topics and linguistic/pedagogical goals.

To log on to Didi, open the file 'access_codes.csv' and pick a random login. You should for the user name use the 'nyulearner' one and for password use the string directly afterwards. E.g., user name: nyulearner13, passwords: 31renrael. 
If the login doesn't work try a different one.

When prompted to accept long passages of legal text and the like, the correct option will be presented as "Einverstanden" ("Agreed") and should appear as a blue button. You may also see "Ja" ("Yes") or "Nein" ("No").

If you need help with German for this part of the homework, feel free to use dictionaries and resources like Google Translate German as needed (https://translate.google.com/#de/en/) or to ask German speakers.

Didi has many kinds of exercises, and not all of them involve language technology to enhance its feedback. For example, some exercises are memory games, or they ask students to reorder words or to select from a dropdown menu. We are interested in exercises where the user can type in whatever they want. <font color=red>For this assignment, **look specifically for fill-in-the-blank exercises**. On Didi, the names of these exercises tend to end with the words "in sentences" or to start with "Sentences" or "FiB" (fill-in-the-blank).</font> 

To get feedback for a fill-in-the-blank, type it in and then either press "Return" or click on the question mark next to the input. To make the feedback go away, click on the thumbs-up or thumbs-down button below "Ist dieses Feedback hilfreich?" ("Is this feedback helpful?") 

**Please try to avoid clicking on the big blue and red buttons that say "Zwischenspeichern" (save) and "Aufgabe abschließen" (Finish problem) at the bottom of the page, as this would make it harder to reuse the account for future classes. It's no big deal if you do click on these, just try to avoid it if possible.**

**Instructions for the questions 2.1-2.3** Look for three different fill-in-the-blank exercises on Didi. In each of them, identify one example where you believe that Didi has used language technology to enhance its feedback. You will typically have to come up with input that is faulty but close enough to being correct that it triggers error messages that are based on a linguistic analysis (e.g. after entering "My parents always tells", the system says: "Simple present: He, she, it -- the -s must fit. There is no -s with 'I/you/we/they'") as opposed to a canned text response applied as a fallback case (e.g. "This is not what I am expecting -- please try again"). In each case, your answer should include at least the following items, including rough translations as indicated:

1. the ID and title of the exercise (e.g. "T8.2 Sentences with regular and irregular verbs");
2. the instructions of the example (e.g. "Write down the correct form of the simple past for the following verbs.");
2. the prompt (e.g. "The thief ---- to steal two mobile phones from a store in the city centre");
3. your faulty input (e.g. "tryed");
4. the feedback string that the system gave you (e.g. "When an infinitive ends in 'consonant + y', we change the 'y' to 'i' in the simple past.");
5. an explanation of why you guess the computer gave the explanation it did (e.g. "Usually in English, creating a simple past tense is just a matter of adding '-ed' to the end; however, since 'try' ends in a 'y', you have to change the ending letter to 'i'. This is not an intuitive change for nonnative speakers, and many people have most likely tried guessing 'tryed' instead of 'tried'. My guess is that the computer has recognized 'tryed' as a combination of 'try' and '-ed'.")

<!-- BEGIN QUESTION -->

**Question 2.1 (10 points)** Describe your first Didi example here as specified above. Add your answers after the colon in each line.

1. ID and title of exercise:

2. Instructions:

3. Prompt:

4. Faulty input:

5. Feedback string:

6. Guessed explanation:

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.2 (10 points)** Describe your second Didi example here as specified above.

1. ID and title of exercise:

2. Instructions:

3. Prompt:

4. Faulty input:

5. Feedback string:

6. Guessed explanation:

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 2.3 (10 points)** Describe your third Didi example here as specified above.

1. ID and title of exercise:

2. Instructions:

3. Prompt:

4. Faulty input:

5. Feedback string:

6. Guessed explanation:

<!-- END QUESTION -->

# 3. POS tagging


Newspaper headlines use a particular writing style which often results in ambiguity (referred to by the writers of
Language Log as [Crash Blossoms](https://languagelog.ldc.upenn.edu/nll/?cat=118),  based on the headline "Violinist linked to JAL crash blossoms").

According to Jonathan Haber at [Huffington Post](https://www.huffpost.com/entry/british-left-waffles-on-f_b_11803832):

> "British Left Waffles on Falklands" was an actual news headline that appeared in the UK’s Guardian newspaper during the 1982 war between Britain and Argentina over who would control the Falkland Islands.  The story under that title talked about how left-wing political parties in England were pulled between patriotism on one hand and ideological distaste for both war and Conservative Prime Minister Margaret Thatcher on the other.
>
> This tug of war left the Left turning this way and that, unable to decide what to do next, an indecisive state commonly referred to (in Britain anyway) as “waffling.”
>
> Now that the Falkland war is history and the political leaders (both Left and Right) who argued over it retired or dead, it’s easy to read the phrase not as a description of politics surrounding an armed conflict, but rather as a far sillier aftermath involving the British leaving behind their breakfast.
>
> Ambiguity derives from the fact that many (actually most) words have more than one meaning. The most obvious example in the headline we’ve been looking at is the word “waffle” which was originally intended as a verb, but can be read as a noun in the sillier forgetting-breakfast interpretation. But on careful inspection, every word is playing a different role between the two interpretations.
>
> “British,” for example, is serving as an adjective modifying “Left” in the serious interpretation, but as a noun in the silly one.  In contrast, “Left” is a noun in the serious version, but the silly version uses that same word as a verb.  “Falklands” is a noun in both versions, but in one version it refers to a set of Islands (silly), in another to the Falklands War (serious).  Even humble little “on” is being used to mean “about” (serious) vs. “in a specific location” (silly).

**Question 3.1 (3 points)** 
Part-of-speech tagging (POS tagging) is the process of annotating a word in a text as corresponding to a particular part of speech: noun, verb, adjective, determiner, preposition, etc.. POS taggers are an example of language technology that works quite well. In this question, we're going to try and break the POS tagger at https://parts-of-speech.info/, which is based on the  [Stanford University Part-Of-Speech Tagger](http://nlp.stanford.edu/software/tagger.shtml). 

(Why? Because breaking things can be a good way to understand how they work.)  

Copy and paste the headline "British left waffles on Falklands" into the text field of the tagger. Click on "POS-tag". Hover over the colored word "waffles" in the output to see what tag the tagger assigns to it. Does the output of the POS tagger correspond to the serious or the silly interpretation of the headline? 

Your answer:

**Question 3.2 (30 points total)** Here are some more ambiguous headlines, via [Language Log's Crash Blossoms collection](https://languagelog.ldc.upenn.edu/nll/?cat=118):

### [Republicans look to safety net programs as deficit balloons](https://www.nytimes.com/2018/10/26/us/politics/medicare-medicaid-social-security-republicans-elections.html)
### [Alaskan-developed satellite technology helps fire managers in COVID-19 era](https://www.ktuu.com/content/news/Alaska-develop-satellite-technology-helps-fire-managers-in-COVID-19-era-571231291.html)
### [Council questions warrant arrests](https://languagelog.ldc.upenn.edu/nll/?p=28568)
### [Second Ave. change orders pressure December completion](http://secondavenuesagas.com/2016/06/27/second-ave-change-order-sagas-press-december-completion/)
### [Indiana poll bears good news for Trump](https://languagelog.ldc.upenn.edu/nll/?p=25526#more-25526)
### [China Nov inflation edges up, but deflation risks dog economy](https://www.reuters.com/article/us-china-economy-inflation-idUSKBN0TS17T20151209#0TvUqezeCEr8Zvb4.97)
### [Paris Attacks Cloud Conversation At Summit Of World Powers](https://www.npr.org/2015/11/15/456120983/paris-attacks-cloud-conversation-at-summit-of-world-powers)
### [New York jets ship toilet rolls to UK](https://languagelog.ldc.upenn.edu/nll/?p=21494)
### [Trump insults rattle rivals, please fans](https://languagelog.ldc.upenn.edu/nll/?p=20936)
### [Missing woman remains found](https://languagelog.ldc.upenn.edu/nll/?cat=118&paged=4)
### [Stella McCartney: 'My parents opened doors and closed minds'](https://languagelog.ldc.upenn.edu/nll/?p=16117)
### [Watch batteries while you wait](https://languagelog.ldc.upenn.edu/nll/?p=15368)
### [EU rules ‘mean children can’t get life-saving cancer drugs’](https://www.euractiv.com/section/science-policymaking/news/eu-rules-mean-children-can-t-get-life-saving-cancer-drugs/)
### [Corn maze cutter stalks fall fun across country](https://languagelog.ldc.upenn.edu/nll/?p=6778)
### [Jury awards \\$6.5M in CA case of nozzle thought gun](https://languagelog.ldc.upenn.edu/nll/?p=4551)
### [Analysis: China currency move nails hard landing risk coffin](https://languagelog.ldc.upenn.edu/nll/?p=3903)
### [Chinese cooking fat heads for Holland](https://languagelog.ldc.upenn.edu/nll/?p=3867)
### [Romney wins mask lingering questions about his candidacy](https://languagelog.ldc.upenn.edu/nll/?p=3766)
### [Does Donald Trump support matter?](https://languagelog.ldc.upenn.edu/nll/?p=3746)
### [Iranian TV shows downed US drone](https://languagelog.ldc.upenn.edu/nll/?p=3615)
### [Dog helps lightning strike Redruth mayor.](https://languagelog.ldc.upenn.edu/nll/?p=3531)
### [Missouri: Flood Damage Dwarfs Repair Budget](https://languagelog.ldc.upenn.edu/nll/?p=3433)
### [Transgenic grass skirts regulators](https://languagelog.ldc.upenn.edu/nll/?p=3304)
### [Bishops agree sex abuse rules](https://languagelog.ldc.upenn.edu/nll/?p=3070)
### [Qaddafi Forces Bear Down on Strategic Town as Rebels Flee](https://languagelog.ldc.upenn.edu/nll/?p=3022)
### [Council hires ban bid taxi firm](https://languagelog.ldc.upenn.edu/nll/?p=2572)
### [Ghost fishing lobster traps target of study](https://languagelog.ldc.upenn.edu/nll/?p=2509)
### [RESPA overcharges dead in the Ninth Circuit](https://languagelog.ldc.upenn.edu/nll/?p=2394)
### [Greece fears batter markets again](https://languagelog.ldc.upenn.edu/nll/?p=2285)
### [Number of Lothian patients made ill by drinking rockets](https://languagelog.ldc.upenn.edu/nll/?p=2156)

In this exercise, you are going to try and break the POS tagger. That is, you are going to try and find inputs that make it return an incorrect interpretation. Pick three headlines from this list. (If you can't figure out which ones, it doesn't really matter. Don't overthink the choice. If you have trouble understanding a headline or the POS tagger output for it, feel free to just move to another.) 

Copy and paste each of them in turn into the POS tagger. Does the output of the POS tagger correspond to the serious or the silly interpretation of the headline? How can you tell? For each headline you picked, your answer should include at least the following:

1. the headline itself -- e.g. "British left waffles On Falklands"
2. a brief but unambiguous paraphrase of the serious interpretation -- e.g. "Britain's political left is indecisive about the Falklands war" 
3. a brief but unambiguous paraphrase of the silly interpretation -- e.g. "The British have left breakfast food behind on the Falkland islands"
4. a single word indicating whether the tagger gets the serious or the silly interpretation -- e.g. "serious" or "silly"; if the tagger fails completely (e.g. tagging everything as a noun), just say "fail" and explain in the next point
5. a brief statement in the format "because this rather than that" explaining how you were able to tell which interpretation the tagger got -- e.g. "because the tagger incorrectly tagged 'waffles' as a noun rather than a verb" or "the tagger correctly tagged 'waffles' as a verb rather than a noun". If you think the tagger failed altogether, explain briefly why you think so -- e.g. "because the tagger tagged everything as a noun rather than, say, identifying 'waffles' as a verb"

(I know it's annoying to have to explain a joke, but I hope you at least had as much fun reading through the list of headlines as I had putting it together)

<!-- BEGIN QUESTION -->

**Question 3.2 part 1 (10 points)** Describe your first headline here as specified above.

1. Headline:

2. Serious paraphrase:

3. Silly paraphrase:

4. Serious, silly, or fail:

5. Reason ("because X rather than Y"): 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.2 part 2 (10 points)** Describe your second headline here as specified above.

1. Headline:

2. Serious paraphrase:

3. Silly paraphrase:

4. Serious, silly, or fail:

5. Reason ("because X rather than Y"): 

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

**Question 3.2 part 3 (10 points)** Describe your third headline here as specified above.

1. Headline:

2. Serious paraphrase:

3. Silly paraphrase:

4. Serious, silly, or fail:

5. Reason ("because X rather than Y"): 