# Assignment 3a

* Please submit your assignment (notebooks of parts 3a and 3b + Python modules) as **a single .zip file** using Canvas (Assignments --> Assignment 3). Please put the notebooks for Assignment 3a and 3b as well as the Python modules (files ending with .py) in one folder, which you call ASSIGNMENT_3_FIRSTNAME_LASTNAME. Please zip this folder and upload it as your submission.

* Please name your zip file with the following naming convention: ASSIGNMENT_3_FIRSTNAME_LASTNAME.zip

**IMPORTANT NOTE**:
* The students who follow the Bachelor version of this course, i.e., the course Introduction to Python for Humanities and Social Sciences (L_AABAALG075) as part of the minor Digital Humanities, do **NOT have to do Exercises 3 and 4 of Assignment 3b**
* The other students, who follow the Master version of Programming in Python for Text Analysis (L_AAMPLIN021), are required to **DO Exercises 3 and 4 of Assignment 3b**

If you have **questions** about this topic, please contact us **(cltl.python.course@gmail.com)**. Questions and answers will be collected on Piazza, so please check if your question has already been answered first.


In this block, we covered a lot of ground:

* Chapter 12 - Importing external modules 
* Chapter 13 - Working with Python scripts
* Chapter 14 - Reading and writing text files
* Chapter 15 - Off to analyzing text 


In this assignment, you will first complete a number of small exercises about each chapter to make sure you are familiar with the most important concepts. In the second part of the assignment, you will apply your newly acquired skills to write your very own text processing program (ASSIGNMENT-3b) :-). But don't worry, there will be instructions and hints along the way. 


**Can I use external modules other than the ones treated so far?**

For now, please try to avoid it. All the exercises can be solved with what we have covered in block I, II, and III. 


## Functions & scope

### Excercise 1:

Define a function called `split_sort_text` which takes one positional parameter called **text** (a string).

The function:
* splits the string on a space character, i.e., ' '  [not all whitespace!]
* returns all the unique words in alphabetical order as a list.

* Hint 1: There is a specific python container which does not allow for duplicates and simply removes them. Use this one. 
* Hint 2: There is a built-in function which sorts items in an iterable called 'sorted'. Look at the documentation to see how it is used. 
* Hint 3: Don't forget to write a docstring. Please make sure that the docstring generally explains with the input is, what the function does, and what the function returns. If you want, but this is not needed to receive full points, you can use [reStructuredText](http://docutils.sourceforge.net/rst.html).

In [13]:
# Defining the function split_sort_text with one positional parameter 'text' with the data type str 
def split_sort_text(text: str, /):
    """
    Takes takes a string of text, converts it to a list of unique words used in the text 
    and returns this list sorted alphabetically.
    
    :param text: a string of text 
    :return: a list of words sorted alphabetically
    """
    # using the split method to split the string into a list of words
    words = text.split(' ')
    
    # converting the list to a set to obtain the unique words
    unique_words = set(words)
    
    # converting the set back to get a list of unique words
    list_unique_words = list(unique_words) 
    
    # sort the list on alphabetical order using the built-in sorted function
    sorted_list_unique_words = sorted(list_unique_words) 
    
    return sorted_list_unique_words

# We use a sample text from Assignment 2 to test our function with
a_text = "In a far away kingdom, there was a river. This river was home to many golden swans."

# We call the function and assign the returned list to sorted_unique_words_a_text
sorted_unique_words_a_text = split_sort_text(a_text)

# We print the list
print(sorted_unique_words_a_text)

['In', 'This', 'a', 'away', 'far', 'golden', 'home', 'kingdom,', 'many', 'river', 'river.', 'swans.', 'there', 'to', 'was']


## Working with external modules

### Exercise 2
NLTK offers a way of using WordNet in Python. Do some research (using google, because quite frankly, that's what we do very often) and see if you can find out how to import it. WordNet is a computational lexicon which organizes words according to their senses (collected in synsets). See if you can print all the **synset definitions** of the lemma **dog**.

Make sure you have run the following cell to make sure you have installed WordNet:

In [14]:
# For this exercise I used the following source: https://www.nltk.org/howto/wordnet.html

import nltk
from nltk.corpus import wordnet as wn

# uncomment the following line to download material including WordNet
nltk.download('book')
nltk.download('omw-1.4')

[nltk_data] Downloading collection 'book'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /Users/koenvanderpool/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package brown to
[nltk_data]    |     /Users/koenvanderpool/nltk_data...
[nltk_data]    |   Package brown is already up-to-date!
[nltk_data]    | Downloading package chat80 to
[nltk_data]    |     /Users/koenvanderpool/nltk_data...
[nltk_data]    |   Package chat80 is already up-to-date!
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     /Users/koenvanderpool/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package conll2000 to
[nltk_data]    |     /Users/koenvanderpool/nltk_data...
[nltk_data]    |   Package conll2000 is already up-to-date!
[nltk_data]    | Downloading package conll2002 to
[nltk_data]    |     /Users/koenvanderpool/nltk_data...
[nltk_data]    |   Package conll20

True

In [15]:
# All the synset definitions of the lemma dog
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

## Working with python scripts

### Exercise  3

#### a.) Define a function called `my_word_count`, which determines how often each word occurs in a string and returns the result as a python dictionary. Do not use NLTK just yet. Find a way to test it. 

* Write a helper-function called `preprocess`, which removes the punctuation specified by the user, and returns the same string without the unwanted characters. You should call the function `preprocess` inside the `my_word_count` function.

* Remember that there are string methods that you can use to get rid of unwanted characters. Test the `preprocess` function using the following string `'this is a (tricky) test'` by attempting to remove the opening and closing parentheses.

* Remember how we used dictionaries to count words? If not, have a look at Chapter 10 - Dictionaries. 

* Make sure you split the string on a space character ' '. You then can loop over the list to count the words.

* Test your function using an example string, which will tell you whether it fulfills the requirements (remove punctuation, split, count).

#### b.) Create a python script 

Use your editor to create a Python script called **count_words.py**. Place the function definition of the **my_word_count** function in **count_words.py**. Also call the function **my_word_count** in this file to test it. Print the results (you can choose the parameters of this call). Place your helper function definition, i.e., **preprocess**, in a separate script called **utils_3a.py**. Import your helper function **preprocess** into count_words.py. Test whether everything works as expected by calling the script count_words.py from the terminal.

The function **preprocess** preprocesses the text by removing characters that are unwanted by the user. **preprocess** is called within the **my_word_count**. The function **my_word_count** uses the output from the preprocess function and creates a dictionary in which the key is a word and the value is the frequency of the word.

**Please submit these scripts together with the notebooks**.

Don't forget to add docstrings to your functions. 

In [16]:
# Feel free to use this cell to try out your code. 

# a) Testing the function preprocess with the text string in the description

###Taken from Feedback Session Block II - 24 september 2023

# I took the code named clean_text_general from the feedback session and I made 
# very small changes in the naming of the variables. The code in the feedback 
# session looked already quite similar in idea to the code I wrote for Assignment 2, 
# but the one in the feedback session is a bit more efficient and clear, so 
# therefore I use that one.

def preprocess(text: str, punct_to_remove, /):
    """
    Removes punctuation characters from a given text string.
    
    :param text: a string of text you wish to preprocess
    :param punct_to_remove: the punctuation characters you wish to remove
    :return: a preprocessed string of text removed from its punctuation characters
    """
    
    # assigning the text to a new variable to improve readability and clarity of the code
    preprocessed_text = text
    
    # iteration over all punctuation characters
    for punct in punct_to_remove:
        preprocessed_text = preprocessed_text.replace(punct, '') # replace all punctuation characters in text with empty str

    # return the preprocessed text
    return preprocessed_text

###

# We assign the string to the variable tricky_test
tricky_test = 'this is a (tricky) test'

# We call the function preprocess to remove parentheses and assign the resulting string to preprocessed_tricky_test
preprocessed_tricky_test = preprocess(tricky_test, {'(', ')'})

# We print the result
print(preprocessed_tricky_test)

this is a tricky test


In [17]:
# a) Testing the function my_word_count with an example string

# Defining the function with two positional parameters
def my_word_count(text: str, punct_to_remove, /):
    """
    Given a string of text, this function cleans the text from its punctuation and returns 
    a dictionary containing the words used it the text and their respective counts. 
    
    :param text: a string of text for which you want to determine the word count
    :param punct_to_remove: the punctuation characters you wish to remove from the text
    :return: a dictionary of the words used in the text and their respective counts
    """
    # initialise dictionary
    word2freq = {}
    
    # call the function preprocess to have the string text cleaned from punctuation
    preprocessed_text = preprocess(text, punct_to_remove) 
    
    # split the preprocessed text into words
    words = preprocessed_text.split(' ')
    
    # loop over the words to count them and add them to the dictionary
    for word in words:
        if word in word2freq:
            word2freq[word] += 1
        else:
            word2freq[word] = 1
            
    # returning the dictionary        
    return word2freq

# We assign a string sentence to the variable a_sentence
a_sentence = '"Look at that dog!" he exclaimed. "I have never seen such a dog. Have you?"'

# We call the function and assign the resulting string to word_count_a_sentence
word_count_a_sentence = my_word_count(a_sentence, {',', '.', '"', '?', '!', ':', ';'})

# We print the result
print(word_count_a_sentence)

{'Look': 1, 'at': 1, 'that': 1, 'dog': 2, 'he': 1, 'exclaimed': 1, 'I': 1, 'have': 1, 'never': 1, 'seen': 1, 'such': 1, 'a': 1, 'Have': 1, 'you': 1}


In [18]:
# b) # You can test the function my_word_count with another piece of text by calling the script count_words.py 
# from the terminal

## Dealing with text files

### Exercise 4

**Playing with lyrics**

a.) Write a function called `load_text`, which opens and reads a file and returns the text in the file. It should have the file path as a parameter. Test it by loading this file: ../Data/lyrics/walrus.txt

* Hint: remember it is best practice to use a context manager
* Hint: **FileNotFoundError**: This means that the path you provide does not lead to an existing file on your computer. Please carefully study Chapter 14. Please determine where the notebook or Python module that you are working with is located on your computer. Try to determine where Python is looking if you provide a path such as “../Data/lyrics/walrus.txt”. Try to go from your notebook to the location on your computer where Python is trying to find the file. One tip: if you did not store the Assignments notebooks 3a and 3b in the folder “Assignments”, you would get this error.

b.) Write a function called `replace_walrus`, which takes lyrics as input and replaces every instance of 'walrus' by 'hippo' (make sure to account for upper and lower case - it is fine to transform everything to lower case). The function should write the new version of the song to a file called 'walrus_hippo.txt and stored in ../Data/lyrics. 

Don't forget to add docstrings to your functions. 

In [20]:
# a) Defining a function with one positional parameter

def load_text(file_path: str, /):
    """
    Opens a file at the specified path, reads its contents, and returns the text as a string.
    
    :param file_path: the pathname of the file you want to open
    :return: the content of the file as a string of text
    """
    # using a context manager to access the content of the file
    with open(file_path, "r") as file:
        text  = file.read() # read the entire file and assign it to the variable 'text'
    
    # return the string of text
    return text

# load the song lyrics from its file and assign the returned string to the variable 'content'
content = load_text("../Data/lyrics/walrus.txt")

# print 'content' to test the function
print(content)

"I Am The Walrus"
("Magical Mystery Tour" Version)

I am he
As you are he
As you are me
And we are all together

See how they run
Like pigs from a gun
See how they fly
I'm crying

Sitting on a cornflake
Waiting for the van to come
Corporation tee shirt
Stupid bloody Tuesday
Man, you been a naughty boy
You let your face grow long

I am the eggman (Ooh)
They are the eggmen, (Ooh)
I am the walrus
Goo goo g' joob

Mister city p'liceman sitting pretty
Little p'licemen in a row
See how they fly
Like Lucy in the sky
See how they run
I'm crying
I'm crying, I'm crying, I'm crying

Yellow matter custard
Dripping from a dead dog's eye
Crabalocker fishwife pornographic priestess
Boy you been a naughty girl
You let your knickers down

I am the eggman (Ooh)
They are the eggmen (Ooh)
I am the walrus
Goo goo g' joob

Sitting in an English
Garden waiting for the sun
If the sun don't come
You get a tan from standing in the English rain

I am the eggman
They are the eggmen
I am the walrus
Goo goo g' joob

In [22]:
# b) Defining a function with one positional parameter and one keyword parameter

def replace_walrus(lyrics: str, /, replacement_word="hippo"):
    """
    Replaces the word "walrus" in the song lyrics with the (default) word "hippo".
    
    :param lyrics: string of song lyrics
    :return: a string of lyrics in which the word "walrus" is replaced with the (default) word "hippo"
    """
    
    # lowercase all the song lyrics
    modified_lyrics = lyrics.lower()
    
    # replace the word "walrus" with the replacement_word and assign the modified lyrics to a variable
    modified_lyrics = modified_lyrics.replace("walrus", replacement_word)
    
    # using a context manager to write the modified lyrics to a file
    with open("../Data/lyrics/walrus_hippo.txt", "w") as outfile:
        outfile.write(modified_lyrics)

# calling the function with the song lyrics to replace the word "hippo" and store in a file
replace_walrus(content)

## Analyzing text with nltk

### Exercise 5

**Building a simple NLP pipeline**

For this exercise, you will need NLTK. Don't forget to import it. 

Write a function called `tag_text`, which takes raw text as input and returns the tagged text. To do this, make sure you follow the steps below:

* Tokenize the text. 

* Perform part-of-speech tagging on the list of tokens. 

* Return the tagged text


Then test your function using the text snipped below (`test_text`) as input.

Please note that the tags may not be correct and that this is not a mistake on your end, but simply NLP tools not being perfect.

In [23]:
test_text = """Shall I compare thee to a summer's day?
Thou art more lovely and more temperate:
Rough winds do shake the darling buds of May,
And summer's lease hath all too short a date:"""

In [24]:
import nltk

# defining a function with one positional parameter
def tag_text(text: str, /):
    """
    Performs tagging on a given string of text and returns the tagged text.
    
    :param text: a string of text
    :return: the tagged text
    """
    # tokenizing the given text
    tokens = nltk.word_tokenize(text)
    
    # tagging the tokens
    tagged_tokens = nltk.pos_tag(tokens)
    
    # returned the tagged text
    return tagged_tokens

# calling the function and assign the tagged text to the variable tag_test_text
tag_test_text = tag_text(test_text)

# print the tagged text
print(tag_test_text)

[('Shall', 'NN'), ('I', 'PRP'), ('compare', 'VBP'), ('thee', 'JJ'), ('to', 'TO'), ('a', 'DT'), ('summer', 'NN'), ("'s", 'POS'), ('day', 'NN'), ('?', '.'), ('Thou', 'NNP'), ('art', 'RB'), ('more', 'RBR'), ('lovely', 'RB'), ('and', 'CC'), ('more', 'JJR'), ('temperate', 'NN'), (':', ':'), ('Rough', 'NNP'), ('winds', 'NNS'), ('do', 'VBP'), ('shake', 'VB'), ('the', 'DT'), ('darling', 'VBG'), ('buds', 'NNS'), ('of', 'IN'), ('May', 'NNP'), (',', ','), ('And', 'CC'), ('summer', 'NN'), ("'s", 'POS'), ('lease', 'NN'), ('hath', 'NN'), ('all', 'DT'), ('too', 'RB'), ('short', 'JJ'), ('a', 'DT'), ('date', 'NN'), (':', ':')]


## Python knowledge

### Exercise 6

6.a) Explain in your own words the difference between the global and the local scope.

Scope in this regard refers to accessibility. In order to explain the difference between global and local scope, we take the creation of a variable as an example. If a variable is created globally, this means that it can be accessed throughout the program even in local evironments, such as a function. Here's an example:

In [32]:
# a global variable is made
number = 5

def example_function():
    # we access the global variable from a local environment and print it
    print(number)

# calling the function results in the number to be printed
example_function()

5


We can see that a global variable can be accessed in the whole program, even in local environment. However, creations in a local environments remain only accessible in the local scope. For instance, when a variable is created locally, we can only use it in the local environment, so not in the whole programe. Here is an example to demonstate:

In [33]:
def another_example_function():
    another_number = 5
    print(another_number)

print(another_number)

NameError: name 'another_number' is not defined

When we want to print the variable another_number from a global environment, we get an error that the variable is not defined. This is because we are trying to access a local variable from a global environment, but in the global environment this variable does not "exist". The variable only exists in the local scope. Having distinction between global and local scope is very handy for larger projects. Often you want to write different functions that do different operations to similar pieces of data. Being able to do operations in a local environment, keeps your code clean and readable.

6.b) What is the difference between the modes 'w' and 'a' when opening a file?

Opening a file in 'w' mode is similar to opening a file in 'a' mode when the file does not yet exist, namely, the file will be created and you can write data to it. However, when the file does exist, the two modes differ. Opening an already existing  file in 'w' mode will empty the file of its current content and, if you want to write data to it, overwrite it. On the other hand, if you open in 'a' mode, you can add (or append) data to the already existing file. When you open an existing file in 'a' mode, a pointer is placed at the end of the file and when you write to it, the data is added at this pointer.