# Week 1 assignment
The assignment is to fill in the blanks, and get the text cleaning function to work.  Here is an example of a function that takes in a string, and returns the lowercased version:

In [1]:
def make_lowercase(a_string):
    """
    Takes in a string and returns the lowercased version of it.
    """
    lowercased_string = a_string.lower()
    
    return lowercased_string

In [2]:
test_string = 'Here is a String with SOME Capitals.'
lowered_string = make_lowercase(test_string)
print(lowered_string)

here is a string with some capitals.


The `return` part is what actually gives you back the lowercased version of the string.  If you don't have the `return` in there, the function will run, but it won't give you anything back.

# Lists
One of the fundamental data types in Python is a list.  This is a data type which consists of things (could be numbers, strings, etc) within square brackets:

In [3]:
test_list = [1, 3, 5, 8]
print(test_list)

[1, 3, 5, 8]


Lists have a number of built-in methods: https://docs.python.org/3/tutorial/datastructures.html
But here we will only use the .append() method, which adds to the end of the list:

In [4]:
test_list.append(100)
print(test_list)

[1, 3, 5, 8, 100]


# List comprehensions
There are some list comprehensions in the code below.  This is a Python trick which can condense your code.  For example:

In [5]:
# this is a for loop that makes a list of the numbers 2, 4, 6, squared (2^2, etc)
list_of_numbers = []
for i in [2, 4, 6]:
    list_of_numbers.append(i ** 2)
    
# putting a variable at the end of a cell will print out the variable
list_of_numbers

[4, 16, 36]

In [6]:
# we can do the same thing like this, which is called a list comprehension
[i ** 2 for i in [2, 4, 6]]

[4, 16, 36]

# Assignment
Now for the assignment.  Fill in all the blanks (three underscores, \_\_\_) to get the code working.  You can use the test at the bottom of the file to make sure you've got it working properly.

In [7]:
# built-in Python libraries
import string

# you probably need to download stopwords first:
# https://stackoverflow.com/a/41640852/4549682
from nltk.corpus import stopwords
en_stopwords = stopwords.words('english')  # might have to use 'en' instead of 'english'
stopwords = set(en_stopwords)

from nltk.stem.porter import PorterStemmer

In [8]:
#import functions from NLTK library

from nltk.tokenize import word_tokenize  #split text into tokens by words and punctutation
from nltk.probability import FreqDist    #frequency count for tokens 
from nltk.util import ngrams             #get n-connected tokens
from nltk.stem import WordNetLemmatizer  #lemmatize words to dictionary form

In [9]:
#lemmatize function is set to false by default; pass True to lemmatize words
#parts of speech arguments that can be passed:
#ADJ, ADJ_SAT, ADV, NOUN, VERB = 'a', 's', 'r', 'n', 'v'
#parts of speech tag references sourced from: 
#https://stackoverflow.com/questions/15586721/wordnet-lemmatization-and-pos-tagging-in-python

def clean_text(document, lemmatize=False, pos=None): 
    """
    cleans text for analysis
    
    - lowercases
    - removes punctuation and numbers
    - lemmatizes or stems
    - removes stopwords
    """
    # lowercase text
    print('lowercasing')
    # use the .lower method of strings to lowercase the document
    # https://docs.python.org/3/library/stdtypes.html#str.lower
    lowercased_doc = document.lower()
    
    print('removing punctuation/numbers')
    # remove punctuation and numbers using the "String constants" from the string library:
    # https://docs.python.org/3/library/string.html#string-constants
    # do this before stemming, so things like "act's" turn into 'act' instead of 'act s'
    table = str.maketrans({key: None for key in string.punctuation + string.digits + "‘" + "’"})
    
    # use the 'translate' method on each of the docs to remove punctuation
    # here is an example: https://stackoverflow.com/a/34294398/4549682
    clean_document = lowercased_doc.translate(table)
    
    # stem words -- basically chop off the ends
    
    stemmer = PorterStemmer()
    wnl = WordNetLemmatizer()
    #printed_status=False
    
    stems = []
    
    # use the split method of strings to split the document (at spaces) into words:
    # https://docs.python.org/3/library/stdtypes.html#str.split
    # this will also remove extra spaces at the ends and beginnings of words
    words = clean_document.split()
    
    #will print the status of either doing lemmitization or stemming
    #based on argument passed in clean_text
    if lemmatize==True: print("lemmatizing")
    else: print("stemming")
    
    for w in words:
        
        if lemmatize==True:
            if pos != None:
            #lemmatize using part of speech passed in function argument
                stems.append(wnl.lemmatize(w, pos=pos))
            
            else: #default part of speech is noun
                stems.append(wnl.lemmatize(w))
            
        # stem the word with the stemmer, and add to the 'stems' list:
        # http://www.nltk.org/howto/stem.html
        else:
            stems.append(stemmer.stem(w))

    clean_document = stems
    
    # remove stopwords
    print('removing stopwords')
    # this is the list comprehension way to do it
    # clean_document = [w for w in clean_document if w not in stopwords]
    
    clean_words = []
    for w in clean_document:
        # make sure the word is not in the stopword set
        # remember -- we created the 'stopwords' variable in the cell above
        if w not in stopwords:
            clean_words.append(w)
    
    """
    join tokens back into a single string with the .join() method of strings:
    https://www.tutorialspoint.com/python/string_join.htm
    https://docs.python.org/3/library/stdtypes.html#str.join
    You will want to join the strings with a single space.
    """
    clean_document = " ".join(clean_words)
    
    # return the clean document
    return clean_document

In [10]:
test_string = 'this test string has    CAPS, numbers 9-8, punctuation !*&$, and stopwords like "the"'

In [11]:
correct_output = 'thi test string ha cap number punctuat stopword like'
your_result = clean_text(test_string)
print('\n'*3)

if your_result == correct_output:
    print('Hooray! You got it.')
else:
    print('Whoops, something is wrong.  Your function returned:')
    print(your_result)
    print('but it should look like:')
    print(correct_output)

lowercasing
removing punctuation/numbers
stemming
removing stopwords




Hooray! You got it.


# Now for the rest of the assignment, test it out on a guterberg book, a news article, or some other text.  Then get the top-frequency words with sklearn's CountVectorizer or nltk's FreqDist, or Counter.

***

## Week 1 Assignment 

### Summary:

Using functions from the Natural Language Toolkit library for Python, I read in the text from a file that has a [Grimms' Fairy Tales](http://www.gutenberg.org/ebooks/2591) excerpt of the story "The Twelve Dancing Princessess", written by Jacob Grimm and Wilhelm Grimm. The text read in as a string and was stored in a variable called `text`. In the `clean_text` function above, I modified the argument parameters to initialize a default variable `lemmatize` that is set to `False`. Further in the function with the list of clean words returned from the `.split` function (with the text lowercased and the punctuation/digits removed), the lemmatize Boolean argument determines if the list of words is lemmatized or stemmed. If an argument of `lemmatize=True` is passed into the `clean_text` function, then `lemmatizing` will be printed and the words will be reduced to their dictionary form based on the verb part of speech. Otherwise if there is no input given for the argument `lemmatize`, then the default value `lemmatize=False` remains and the word `stemming` will print before the `else` statement for the `for w in words` loop is initialized. In the `else` statement because the value of `lemmatize` is `False`, the `if lemmatize=True` statement will not trigger and instead the words will be stemmed.

In addition to the `lemmatize` argument, I added a `pos` argument that would hold the part-of-speech argument that would determine which part of speech the text data should be lemmatized with. The default argument value is `None` so that in the function when no `pos` argument is given, the default value of noun (`"n"`) would be used with the WordNetLemmatizer function. 

Lastly, I added two non-standard apostrophe punctuation symbols to the `maketrans` function in order to also remove them from the string when cleaning the text data.

In [12]:
#read in text from file as a string
#string value is set to variable "text"
with open("12dancingprincesses.txt") as f:
    text = f.read()

In [13]:
#verify the first 5 characters in the "text" string variable
text[:5]

'THE T'

#### Stemmed Tokens

In this section of the code, I ran the `clean_text` function on the string text read in from the `12dancingprincesses.txt` file. Because I did not pass a `lemmatize=True` argument, the default argument `False` stayed as it was and my text was stemmed after removing punctuation and digits. Once the cleaned text returned, it was stored into a variable `clean_result` and passed to the NLTK `work_tokenize` function. Then the frequency for each tokenized words was calculated using the `FreqDist` function and the `most_common(10)` function returned the top 10 most frequently occuring words in the frequency distribution list.

In [14]:
#run the clean_text function on the "text" variable
#default argument is to stem words (lemmatize=False)
#words in this text will be stemmed

#clean_result is a string
clean_result = clean_text(text)

lowercasing
removing punctuation/numbers
stemming
removing stopwords


In [15]:
#separate words and leftover punctuation into tokens
#this will generate a list
tknzwords = word_tokenize(clean_result)

In [16]:
#use list of tokens
#FreqDist will count the frequency of each token
#most_common(10) will return top 10 frequently occuring tokens
FreqDist(tknzwords).most_common(10)

[('wa', 24),
 ('princess', 22),
 ('soldier', 19),
 ('king', 17),
 ('said', 16),
 ('danc', 12),
 ('hi', 12),
 ('twelv', 11),
 ('went', 11),
 ('came', 10)]

#### Lemmatized Tokens

For this section of code, I used the `clean_text` function but instead of using the default argument `lemmatize=False`, I passed in the `True` value to trigger the cleaned text (lowercased with removed punctuation and digits) to be lemmitized. The default argument for part-of-speech (pos) lemmtization is noun, however I have a second example using the argument `pos="v"` to lemmatize words using the verb part-of-speech.

In [17]:
#use the clean_text function with True value for lemmatize argument
lem_clean = clean_text(text, lemmatize=True)

lowercasing
removing punctuation/numbers
lemmatizing
removing stopwords


In [18]:
#turn cleaned text into word tokens list
tknzlem = word_tokenize(lem_clean)

In [19]:
#top 10 frequently occuring words in the lemmatized word tokens list
FreqDist(tknzlem).most_common(10)

[('wa', 24),
 ('princess', 22),
 ('soldier', 19),
 ('king', 17),
 ('said', 16),
 ('twelve', 11),
 ('went', 11),
 ('came', 10),
 ('eldest', 10),
 ('bed', 8)]

In [20]:
#use clean_text function to lemmatize using verb part-of-speech
vlem_clean = clean_text(text, lemmatize=True, pos="v")

lowercasing
removing punctuation/numbers
lemmatizing
removing stopwords


In [21]:
#word tokens list of verb-lemmatized text
tknzvlem = word_tokenize(vlem_clean)

In [22]:
#top 10 frequently occuring words in verb-lemmatized list
FreqDist(tknzvlem).most_common(10)

[('soldier', 19),
 ('princesses', 17),
 ('say', 17),
 ('go', 16),
 ('king', 13),
 ('dance', 12),
 ('twelve', 11),
 ('come', 11),
 ('eldest', 10),
 ('bed', 8)]

#### N-grams

Using the output from the Stemmed Tokens section, I used `for` loops to iterate through the items in the tokens lists and generated bigram and trigram lists in order to calculate the most frequently occuring n-grams (for n=2 and 3).

In [23]:
#empty list to hold bigram pair items
bgs = []

#add each bigram pair to bgs list
for bigram in ngrams(tknzwords, 2):
    bgs.append(bigram)

In [24]:
#top 10 frequently occuring bigrams
FreqDist(bgs).most_common(10)

[(('princess', 'danc'), 4),
 (('went', 'bed'), 3),
 (('danc', 'night'), 3),
 (('king', 'son'), 3),
 (('came', 'wa'), 3),
 (('one', 'princess'), 3),
 (('third', 'night'), 3),
 (('said', 'wa'), 3),
 (('soldier', 'said'), 3),
 (('golden', 'cup'), 3)]

In [25]:
#empty list to hold trigrams
tgs = []

#add each trigram set to list
for trigram in ngrams(tknzwords, 3):
    tgs.append(trigram)

In [26]:
#top 10 frequently occuring trigrams
FreqDist(tgs).most_common(10)

[(('king', 'son', 'soon'), 2),
 (('second', 'third', 'night'), 2),
 (('princess', 'danc', 'time'), 2),
 (('take', 'care', 'drink'), 2),
 (('grove', 'tree', 'leav'), 2),
 (('three', 'branch', 'golden'), 2),
 (('branch', 'golden', 'cup'), 2),
 (('twelv', 'danc', 'princess'), 1),
 (('danc', 'princess', 'wa'), 1),
 (('princess', 'wa', 'king'), 1)]

### Conclusion:

Overall, the `clean_text` function produced mostly similar output but with slight variations depending on the additional arguments passed into it. Words such as king, twelve, eldest, and princess were fairly among the top 10 frequently occuring words in the text. Both the stemmed and lemmatized (by noun) tokenized words returned "wa" as the most frequently occuring word. But since "wa" is not a recognized word in the English language, I did a manual search through the text and the only similar word that occurs so commonly would be "was", though I would need to research more as to how the functions from the `clean_text` function clean/tokenized the word in that way.

In the stemmed list, the word "went" is in the top 10 of frequently occuring tokens. However, in the 10 most common tokens for the verb-lemmatized words, its dictionary form "go" made the top 10 in a slightly higher frequency (adding naturally occuring "go" word in the text to the count of lemmatized "went"). 

Comparing the bigrams output to the list of trigrams, some of the seemingly related n-grams were: 

`('princess', 'danc') -> ('princess', 'danc', 'time')` <br>
`('king', 'son') -> ('king', 'son', 'soon')` <br>
`('third', 'night') -> ('second', 'third', 'night')`