# LING 242 Exercise 1: Strings

This exerccise will
- give you some more practice with the basic string methods available in Python
- have you iterate over NLTK corpora

## Getting started

This exerccise requires that you have downloaded following NLTK corpora/lexicons

In [1]:
import nltk
nltk.download("treebank")
nltk.download("brown")
nltk.download("reuters")

[nltk_data] Downloading package treebank to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package brown to /Users/jungyeul/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package reuters to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package reuters is already up-to-date!


True

Run the code below so you can access them

In [2]:
from nltk.corpus import treebank,brown,reuters,stopwords
import matplotlib.pyplot as plt
from wordcloud import WordCloud

### Exercise 1: POS-tagged sentence to string

Part-of-speech (POS) tagged sentences from the Brown corpus accessed by iterating using the <i>tagged_sents</i> method look like this:

\[('The', 'AT'), ('Hartsfield', 'NP'), ('home', 'NR'), ('is', 'BEZ'), ('at', 'IN'), ('637', 'CD'), ('E.', 'NP'), ('Pelham', 'NP'), ('Rd.', 'NN-TL'), ('Aj', 'NN'), ('.', '.')\]

That is, they are lists of (word, part-of-speech) tuples, where both word and the POS are strings. You'll notice that the POS tags for words like "Hartsfield" and "home" start with "N", indicating these are nouns, see [the wikipedia entry for the Brown corpus](https://en.wikipedia.org/wiki/Brown_Corpus#Part-of-speech_tags_used) for more infomation about the meanings of these tags.

In this exercise you will convert a sentence of the Brown corpus to a single string (appropriate for writing to disk) that looks like this:

the/AT hartsfield/NP home/NR is/BEZ at/IN 637/CD e./NP pelham/NP rd./NN Aj/NN ./.

Note the following changes:
- the sentence has been converted into a single string
- the words are now lowercase
- any POS tag with a hyphen consists only of the part before the hyphen (the marking after the hyphen contains information that is not included in most POS tagsets)
- Each word and part of speech tag is separated by a forward slash (/)
- Each word/pos pair is separated by a space.

Do this for the first sentence of the Brown corpus, which has already been extracted for you. Assign the result to the variable `string_sent` and check you have done this correctly using the provided asserts.

In [5]:
tagged_sent = brown.tagged_sents()[0]
string_sent = []

# print(string_sent)

In [None]:
assert string_sent == "the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS took/VBD place/NN ./."
print("Success!")

### Exercise 2: Exploring the Reuters corpus

NLTK provides access to the `reuters` corpus, which are a collection of Reuters new articles labelled by topic. 

* The size of the corpus, in word tokens

In [None]:
from nltk.corpus import reuters, stopwords
from collections import Counter

* The number of word types, after lowercasing:

In [None]:
word_list = []
for word in reuters.words():
    word_list.append(word.lower())

len(Counter(set(word_list)))

* The topics associated with the document labelled 'test/14840':

* The number of documents which are labelled with the "gold" topic

In [None]:
i = 0
for f in reuters.fileids():
    if 'gold' in reuters.categories(f):
        i += 1
print (i)

* The total number of sentences in documents which have the "tea" topic

In [None]:
num = 0
for f in reuters.fileids('tea'):
    num += len(reuters.sents(f))

print(num)

Finally, create a word cloud for the "coffee" topic, excluding stopwords (this will take more than one line).

In [None]:
stopwords_set = stopwords.words("English")

# wordcloud = WordCloud(stopwords = stopwords_set).generate(" ".join(corpus.words()))

### Exercise 3: Building word lists

In this exercise, you'll be creating lists of words from the Brown corpus which have some particular property. You should iterate over each word in the corpus (using a <i>for</i> loop and the <i>words</i> method), and add to it the list (using append) if it is the kind of word you're looking for. Then you should print out the length of the list as well as the first five examples you found (remember that slicing also works for lists!). It is okay if you repeat the code for iterating over the Brown in each cell (also okay if you decompose this and use a function!)

#### 3.1

Find words which contain the lowercase letter x

In [None]:
word_list = []

len(word_list)

In [None]:
assert len(word_list) == 9365
assert "index" in word_list
assert "indices" not in word_list
print("Success!")

#### 3.2

Find capitalized words (words that start with a upper case letter)

In [None]:
word_list = []
#your code here
# ...isupper():

len(word_list)

In [None]:
assert len(word_list) == 125369
assert "The" in word_list
assert "the" not in word_list
print("Success!")

#### 3.3
rubric={accuracy:2}

Find present particles (words of at least length 6 that end with "ing")

In [None]:
word_list = []
#your code here
# word[-3:] == 'ing' ...
# endswith('ing') ...

len(word_list)

In [None]:
assert len(word_list) == 25038
assert "waiting" in word_list
assert "wait" not in word_list
print("Success!")

#### 3.4
rubric={accuracy:2}

Find entirely alphabetic words which contain no vowel. Your solution should be case insensitive!

In [None]:
vowels = set(["a","e","i","o","u","y"])
word_list = []
#your code here

print(len(word_list))
print(word_list[:5])

In [None]:
assert len(word_list) == 1098
assert "CD" in word_list
assert "CAD" not in word_list
print("Success!")

#### 3.5

Find all words which contain a single hyphen and each of the parts of the hypenated word is at least four letters 

In [None]:
word_list = []
#your code here
#  word.find('-') == word.rfind('-') ## rfind() = last occurrence of the string;
#  word.split("-") == 2

len(word_list)

In [None]:
assert len(word_list) == 4509
assert "hard-fought" in word_list
assert "hard-won" not in word_list
print("Success!")

### Exercise 4: Add one 

Numbers that you run across in text corpora will actually be strings, not integers or floats. If you want to use them as numbers for some purpose, you'll need to convert them. Iterate over the sentences in the Penn Treebank corpus (using the <i>sents</i> method) and find integers written in arabic numerials (your code doesn't need to handle numbers written in English, e.g. "three", though if you want to add that functionality, great!). When you find a number in a sentence, print the sentence, add one to the number, then print out the sentence again with the new number (HINT: you will need two loops, one which iterates over the sentences in the corpus and one that iterates over the words in each sentence; however, in order to modify the sentence as we have asked you to, you'll want to iterate over the words in the sentence using a index). It's okay if you print out the sentence multiple times if it contains multiple numbers.

Both original and modified sentence should be lists of strings; to confirm this, please <i>join</i> them (with spaces) before you print them.  There will be some weird cases of numbers where you don't expect them: any idea what's going on there?

In [None]:
sent = ['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the', 'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']

In [None]:
for word in sent:
    print(word)

In [None]:
i = 0
for word in sent:
    print(i, word)
    i+=1

In [None]:
for i in range(len(sent)):
    word = sent[i]
    print(i, word)

In [None]:
for i, word in enumerate(sent):
    print(i, word)