# Foundations of Language Technology 2022/23

## Homework 1

The deadline for group selection on Moodle is on November 7th 2022. 

Please send your solution as a zip-file containing all files that were provided for this homework (*.ipynb). Please include all files that are needed to run your submission, and all additional files that we ask for in the tasks. Include comments in your program code to make it easier readable.

**Naming template: Group_X_homework_Y.ipynb, Group_X_homework_Y.zip**

Please replace X with your group number and Y with the homework number. Submissions that do not follow these rules will not be considered. 

Please only modify the template in the specified markdown and code cells (e.g. `YOUR CODE / ANSWER / IMPORTS HERE`). 
Some cells are left blank on purpose. Please do not modify these cells, because they are used to autograde your submission. If these cells are modified, the automatic grading for your submission will fail. Please do not modify the cells containing public and private tests. If you want to do your own tests, please use the code cell containing your code solution (`YOUR CODE HERE`).

The deadline for the homework is **Friday, 18/11/2022**. Late submissions will not be accepted.

In [3]:
# YOUR IMPORTS HERE
import re
import nltk
from nltk.book import *
import csv

ModuleNotFoundError: No module named 'nltk'

--- 
### Task 1: Warm-up: Conditionals (3 points)

Write expressions for finding all words in *text1* (nltk.book) that meet the conditions listed below. The result should be in the form of a **list of words without duplicates** `result`: [’word1’, ’word2’, ...].
* Starts with "de"
* Having all lowercase letters
* Ends with "ed"


In [None]:
result = []
pattern = re.compile(r"^de[a-z]*ed$")

# Alternate solution:
# result = sorted([w for w in set(text1) if w.startswith('de') and w.islower() and w.endswith('ed')])

for word in set(text1):
    found = pattern.match(word)
    if found:
        if found.group() not in result: result.append(found.group())

print(result)

In [None]:
# PUBLIC TEST
for w in result:
    assert w.startswith("de"), "Doesn't start with 'de'"
    assert w.islower(), "Is not lowercase"
    assert w.endswith("ed"), "Doesn't end with 'ed'"

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

--- 
### Task 2: Template-based Sentence Creation (6 points)
Implement a function `sentence_creation` that receives three arguments, temperature, location, and time, and returns a string “The weather is `temperature` in `location` at `time`”, where the placeholders denote the respective values. The function should be able to take both strings (e.g. "Chicago") and integers (e.g. 9) as arguments. Call your function with the arguments temperature = "miserable", location = "Chicago", time = 9.



In [None]:
def sentence_creation(temperature, location, time):
    """
    This function creates a sentence from 3 given string variables using a pre-defined template.
    Parameters
        ----------
        temperature: int/str
            indicates the temperature value, e.g. miserable / 29
        location : int/str
            indicates the location such as city, country, etc.
        time: int/str
            indicates the time of day, e.g. noon / 9
         
    """
    return f"The weather is {temperature} in {location} at {time}"

sentence_creation("miserable", "Chicago", 9)

In [None]:
# PUBLIC TEST
assert sentence_creation("miserable", "Chicago", 9) == 'The weather is miserable in Chicago at 9', "result incorrect / wrong types"

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

--- 
### Task 3: Transposed Letter Effect (12 points)
The [transposed letter effect](https://en.wikipedia.org/wiki/Transposed_letter_effect) is a test of how a word is processed when two letters within the word are switched. This refers to the phenomenon in which a word can still be understood despite being permuted.

Define a function `transposed_letter_effect` which receives a sequence of words (as a single string) and returns the sequence of words (as a single string) with transposed letters. The function should, ...
* if the words are no longer than three letters, leave them as they are.
* else, leave the first and last letter unchanged and shuffle all other letters.

Apply the function to the sentence `sent1` below.

*Hint: Use the `random.shuffle()` method which takes a sequence, like a list, and reorganizes the order of the items. Note that this method works on lists.*


In [None]:
import random
sent1 = "If you can read this sentence without mistakes you are a genius"


def transposed_letter_effect(text):
    """
    This function creates a new string from a given string, where the letters of longer words are randomly shuffled.
    Parameters
        ----------
        text: str
            The sentence to be transposed
         
    """
    words = text.split(" ")
    new_sent = []
    for word in words:
        if len(word) <= 3:
            new_sent.append(word)
            continue
        word_list = list(word[1:-1])
        random.shuffle(word_list)

        new_sent.append(f"{word[0]}{''.join(word_list)}{word[-1]}")
    return " ".join(new_sent)

transposed_letter_effect(sent1)

In [None]:
# PUBLIC TEST
assert type(transposed_letter_effect(sent1)) == str, "incorrect parameter or return type"

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

--- 
### Task 4: Reading Files (17 points)
Read the file `german_text.txt` which was downloaded with this homework notebook. 
* (a) Write a function `read_german` which opens and reads a file line-by-line into a list of strings, i.e. one string in the list corresponds to one line of the file. The function should use an encoding which supports the special characters (umlaute) of german text.
* (b) Write a function `write_german` that writes a [`csv` file](https://www.pythontutorial.net/python-basics/python-write-csv-file/) (Comma-separated values file) `german_text_analysis.csv`, 
* (c) where each line refers to one line in the `german_text.txt`. Please also use an encoding supporting umlaute for this function.
* (d) Each line consists of 3 comma-separated values: 
* (e) the line, the length (i.e. number) of the tokens in this line that are space separated, 
* (f) and the lexical richness of the line in percent, i.e. the range of the vocabulary


In [None]:
def read_german(filename):
    """
    Read a file given a file path. Returns the line-by-line contents of the file as a list of strings.
    
    Parameters
    ----
    filename: str
        path to the file location
    """
    lines = []
    for line in open(filename, encoding="UTF-8"):
        lines.append(line)

    return lines

print(read_german("german_text.txt"))

def write_german(text_lines, new_filename):
    """
    Write a list of lines to a file.
    
    Parameters
    ----
    text_lines: list
        list of strings, each string is a text line
    new_filename: str
        path to the file location
    """
    with open(new_filename, encoding="UTF-8", mode="w") as csv_file:
        writer = csv.writer(csv_file)
        writer.writerow(["Line", "Length", "Lexical richness"])
        for row in text_lines:
            writer.writerow([row, len(row.split(" ")), lexical_richness(row)])
       
def lexical_richness(text):
    score = len(set(text)) / len(text)
    score *= 100
    return score

# Run the function
write_german(read_german("german_text.txt"), "german_text_analysis.csv")

In [None]:
# PUBLIC TEST
out = read_german("german_text.txt")
assert type(out) == list
assert type(out[0]) == str

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

--- 
### Task 5: NLTK, Bigrams & Frequencies (14 points)
* (a) Write a function `remove_punctuation` that takes a list of strings and removes all occurences of the punctuation marks `. , ; ( ) ! ] [ ? ' / "`. The function should return a list of strings with no punctuation. All strings in the returned list should be converted to lowercase.
* (b) Trigrams are sequences of three neighboring words. Write a function `trigrams` that returns all trigrams in a given list of strings. The trigrams should be returned as a list of 3-tuples.
* (c) Write a function `trigram_frequency` to compute the frequencies of all trigrams. The function should take a list of trigrams and count the number of occurences of the trigrams. It should return a dictionary with the trigrams as keys and the number of occurences of the trigrams as values.
* (d) Apply your function `remove_punctuation` to text4 of the nltk book corpus and store the resulting list in the variable `words`. 
* (e) Apply the `trigrams` function to the list `words` and store the resulting list in the variable `trigram_list`.
* (f) Apply the `trigram_frequency` function to the list `trigram_list` and store the resulting dict in the variable `trigram_frequency_dict`.
* (g) Output the 10 most frequent trigrams of text4 and their counts using the `trigram_frequency_dict` dict. Store the result in the variable `top_10_trigrams` and print it.


In [None]:
# a
def remove_punctuation(text):
    """
    Removes all punctuation from a given text. Returns a list of strings where every word is in lowercase.
    
    Parameters
    ----
    text: list
        list of strings to be modified
    """
    cleaned_strings = []
    for word in text:
        word = word.rstrip(".,;()!][?'/\"")
        word = word.lower()
        if word == "": continue
        cleaned_strings.append(word)

    return cleaned_strings

# b
def trigrams(text):
    """
    Retrieves all trigrams from a given text. Returns a list of 3-tuples.
    
    Parameters
    ----
    text: list
        list of strings
    """
    trigrams = []
    i = 0
    while i < len(text) - 2:
        trigrams.append((text[i], text[i + 1], text[i + 2]))
        i += 1
    return trigrams

# c
def trigram_frequency(list_of_trigrams):
    """
    Counts the occurences of each trigram in a list of trigrams. Returns a dictionary.
    
    Parameters
    ----
    list_of_trigrams: list
        list of trigrams
    """
    trigram_dict = {}
    for trigram in list_of_trigrams:
        trigram = str(trigram)
        if trigram in trigram_dict.keys():
            trigram_dict[trigram] += 1
        else:
            trigram_dict[trigram] = 1
    return trigram_dict

# d
words = remove_punctuation(text4)
# e
trigram_list = trigrams(words)
# f
trigram_frequency_dict = trigram_frequency(trigram_list)
# g
top_10_trigrams = sorted(trigram_frequency_dict.items(),key= lambda x : x[1], reverse=True)[:10]
print(top_10_trigrams)

In [None]:
# PUBLIC TEST
assert remove_punctuation([".",";",",", "(", ")", "!", "]", "[", "?", "'", "/", '"']) == [], "Punctuation removal incorrect"

assert remove_punctuation(["THIS", "Is", "A", "teST", "!"]) == ["this", "is", "a", "test"], "Conversion incorrect"

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY

In [None]:
# PRIVATE TESTS HAPPEN HERE, DO NOT MODIFY