**Juhwan Lee**

**CS410: Natural Language Processing**

**Assignment 3: Parts-of-Speech Tagging**

**1. (1pt) Search the web for 2 “spoof newspaper headlines”, to find such gems as: British Left Waffles on Falkland Islands, and Juvenile Court to Try Shooting Defendant. Manually tag these headlines to see if knowledge of the part-of-speech tags removes the ambiguity.**

In [None]:
headline = 'British/NOUN Left/VERB Waffles/NOUN on/ADV Falkland/NOUN Islands/NOUN'
[nltk.tag.str2tuple(t) for t in headline.split()]

In [None]:
headline = 'Juvenile/NOUN Court/NOUN to/PRT Try/VERB Shooting/ADJ Defendant/NOUN'
[nltk.tag.str2tuple(t) for t in headline.split()]


**2. (1pt) Tokenize and tag the following sentence: They wind back the clock, while we chase after the wind. What is the output?**

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

sentence = 'They wind back the clock, while we chase after the wind.'
tokens = nltk.word_tokenize(sentence)
tagged = nltk.pos_tag(tokens)
tagged

**3. (1pt) Pick 2 words that can be either a noun or a verb (e.g., contest). Predict which POS tag is likely to be the most frequent in the Brown corpus, and compare with your predictions.**

fall, dance

I think 'fall' is most frequent in the form of verb and 'dance' is most frequent in the form of noun.

In [None]:
from nltk.corpus import brown
nltk.download('brown')
nltk.download('universal_tagset')

tagged_words = brown.tagged_words(tagset='universal')
cfd = nltk.ConditionalFreqDist(tagged_words)

In [None]:
cfd['fall'].most_common()

In [None]:
cfd['dance'].most_common()

'fall' and 'dance' both were most frequent in the form of noun

**4. (2pt) Use sorted() and set() to get a sorted list of tags used in the Brown corpus, removing duplicates.**

In [None]:
tagged_words = brown.tagged_words()
sorted_tagged_words = sorted(tagged_words)
unique_tagged_words = set(sorted_tagged_words)
vals = [val for key, val in unique_tagged_words]
sorted(set(vals))

**5. (4pt) Write programs to process the Brown Corpus and find answers to the following questions: (i) Which nouns are more common in their plural form, rather than their singular form? (Only consider regular plurals, formed with the -s suffix.) (ii) List the top 20 tags in order of decreasing frequency - what do these most frequent tags represent?**

In [None]:
tagged_words = brown.tagged_words()
cfd = nltk.ConditionalFreqDist(tagged_words)
result = []
for word in set(brown.words()):
    if cfd[word+'s']['NNS'] > cfd[word]['NN']:
        result.append((word, cfd[word+'s']['NNS'], cfd[word]['NN']))

result[0:19]

In [None]:
tag_list = [t for (_, t) in tagged_words]
fd = nltk.FreqDist(tag_list)
fd.most_common(20)

**6. (5pt) Generate some statistics for tagged data to answer the following questions: (i) What proportion of word types are always assigned the same part-of-speech tag? (ii) How many word types are ambiguous, in the sense that they appear with at least two tags? (iii) What percentage of word tokens in the Brown Corpus involve these ambiguous word types?**

In [None]:
tagged_words = brown.tagged_words(tagset='universal')
cfd = nltk.ConditionalFreqDist(tagged_words)

In [None]:
proportion = sum(1 for word in cfd if len(cfd[word]) == 1) / len(cfd)
proportion

In [None]:
ambiguous = sum(1 for word in cfd if len(cfd[word]) > 1)
ambiguous

In [None]:
amb_proportion = ambiguous / len(cfd)
amb_proportion

**7. (6pt) Write code to search the Brown Corpus for particular words and phrases according to tags, to answer the following questions: (i) Produce an alphabetically sorted list of the distinct words tagged as MD. (ii) Identify words that can be plural nouns or third person singular verbs (e.g. deals, flies). (iii) What is the ratio of masculine to feminine pronouns?**

In [None]:
words = brown.words()
tagged_words = brown.tagged_words()
set_words = set(words)
cfd = nltk.ConditionalFreqDist(tagged_words)
conditions = cfd.conditions()

md_words = [condition for condition in conditions if cfd[condition]['MD'] != 0]
md_words.sort()
md_words

In [None]:
two_words = [condition for condition in conditions if cfd[condition]['NNS'] and cfd[condition]['VBZ']]
two_words.sort()
two_words

In [None]:
fd = nltk.FreqDist(words)
masc_fem_proportion = (fd['he'] + fd['He']) / (fd['she'] + fd['She'])
masc_fem_proportion

**8. (6pt) How serious is the sparse data problem? Investigate the performance of n-gram taggers as n increases from 1 to 6. Tabulate the accuracy score.**

In [None]:
tagged_sents = brown.tagged_sents()
size = int(len(tagged_sents) * 0.9)
train_sents = tagged_sents[:size]
test_sents = tagged_sents[size:]

for i in range(1,7):
	ngram_tagger = nltk.NgramTagger(i, train_sents)
	print(ngram_tagger.evaluate(test_sents))

**9. (6pt) There are 264 distinct words in the Brown Corpus having exactly three possible tags. (i) Print a table with the integers 1..10 in one column, and the number of distinct words in the corpus having 1..10 distinct tags in the other column. (ii) For the word with the greatest number of distinct tags, print out sentences from the corpus containing the word, one for each possible tag.**

In [None]:
from tabulate import tabulate

tagged_words = brown.tagged_words()
cfd = nltk.ConditionalFreqDist(tagged_words)

num_tags = []
for condition in cfd.conditions():
    num_tags.append((condition, len(cfd[condition])))

tags_by_num = []
for i in range(11):
    this_num = 0
    for (word, num) in num_tags:
        if num == i:
            this_num += 1
    tags_by_num.append((i, this_num))

print(tabulate(tags_by_num))

In [None]:
most_distinct = ""
num_of_tags = 0

for (word, num) in num_tags:
    if num > num_of_tags:
        num_of_tags = num

for (word, num) in num_tags:
    if num == num_of_tags:
        most_distinct = word

most_distinct

In [None]:
distinct_tags = [tag for tag in cfd['that']]
taggend_sents = brown.tagged_sents()

for sent in taggend_sents:
    for (word, tag) in sent:
        for distinct_tag in distinct_tags:
            if distinct_tag == tag and (word == 'That' or word == 'that'):
                print(sent)
                distinct_tags.remove(distinct_tag)
                print("**********")
                break
