# Assignment 1

## Brief
This Jupyter notebook contains a small program that reads in a simple text file and searches for some words. It counts the number of times each specified word appears in the file, and reports the counts to the terminal. This program has been designed to handle certain edge cases in a specific way. Each text file read, is split into a list of words, where text is separated at any non-alphanumeric character. This means a word such as "they're" is split into it's consituent parts, "they" and "re". Therefore, if a word being searched for contains any non-alphanumeric character, the word is split at this point. Consequently, the split words will be searched for in place of the original word. To continue the prior example, if the word "they're" is searched for, this program will search for the words "they" and "re", returning word counts for each. This same logic applies for searching words containing any non-alphanumeric character.

This notebook is split into two main sections, Functions and Testing. Within Functions, some helper functions are documented and defined before the main program <code>main_code_summary()</code>. The Testing section runs through a range of examples to test and demonstrate how <code>main_code_summary</code> works. This covers correct and incorrect usage, as well as edge cases. 

## Functions

#### Helper functions

To improve readability, I have defined some helper functions. These reduce repetition within the code and keep the notebook neat. The purpose of each function is explained before it is defined, and I have categorised these functions as follows:

* Type checking functions
* Search terms cleaning functions
* Table formatting functions
* Word counting functions



In [1]:
import re

#### Type checking functions

    1. type_check(data, expected_type)
    
This function has two parameters <code>data</code> and <code>expected_type</code>. It checks whether the input data matches the specified <code>expected_type</code>, raising an error if not. The first parameter, <code>data</code>, is the input whose type we are checking. The second parameter, <code>expected_type</code>, specifies the expected data type (such as int, list, str). 
The function uses <code>isinstance()</code> to check if data is of the expected type. If <code>data</code> is not of the expected type then the function raises a TypeError. The error message explains that <code>data</code> is not of the necessary type, and uses the <code>.__name__</code> attribute to get the name of the type as a string, for instance printing <code>'str'</code> rather than <code>\<class 'str'\></code>. A TypeError is also rasied if <code>data</code> is an empty string or list. If <code>data</code> is of the correct type, the function returns nothing. 

    2. list_of_str_check(words)

This function has one parameter, <code>words</code>. It checks that <code>words</code> is a list of strings. If the argument is not a list, or if each element in the list is not a string, then a TypeError is raised that explains the requirements to the user. If the argument given passes these checks then nothing is returned. 

    3. input_checks(input1, input2)

This function has two parameters, <code>input1</code> and <code>input2</code>. The function first checks <code>input1</code> is a string, by calling <code>type_check()</code>. It then checks that <code>input2</code> is either a string, or list of strings, using <code>isinstance()</code> and <code>list_of_str_check()</code>. If both arguments are of the correct type, nothing is returned. If not, a TypeError is raised, with an explanation that the arguments must be of a specific type. This function is called within <code>word_count_summary()</code>, to ensure the inputs are of the correct type, building robustness into the program. 

In [2]:
def type_check(data, expected_type):

    # Check data matches the expected type
    if not isinstance(data, expected_type):
        raise TypeError(f"'{data}' must be of type {expected_type.__name__}.")

    # If a string or list, check data is not empty
    if expected_type in [str, list] and not data:
        raise ValueError(f"'{data}' cannot be empty.")

In [3]:
def list_of_str_check(words):
    
    type_check(words, list)

    # Check all elements in the list are non-empty strings
    if not all(isinstance(element, str) and element != "" for element in words):
        raise TypeError(f"Each element in '{words}' must be a non-empty string.")

In [4]:
def input_checks(input1, input2):

    type_check(input1, str)

    # Set up type checking for input2
    if isinstance(input2, str):
        pass
    elif isinstance(input2, list):
        list_of_str_check(input2)
    else:
        raise TypeError(f"'{input2}' must be a string, or list of strings.")

#### Search_terms cleaning functions

    1. split_by_puncutation(word)

This function has one parameter, <code>word</code>, which must be a string. It splits the string by any non-alphanumeric character, effectively splitting at any punctuation or empty spaces. The words found are stored as a list, and duplicates are removed whilst also preserving order. This is done by converting the list to a dictionary, which removes duplicates, and then back into a list using <code>list(dict.fromkeys(split))</code>. The function then returns a list of unique words if multiple unique words were found, or a single string if only one unique word was found. This function is used to check the terms being searched for in <code>word_count_summary()</code> are handled in a consistent way throughout the program. 


    2. punctuation_check_str(words)

This function takes one parameter, <code>words</code>. If <code>words</code> is a string, <code>split_by_punctuation()</code> is called on <code>words</code> and the function returns the result. If <code>words</code> is a list, <code>split_by_punctuation()</code> is called on each element in the list and the result is saved into a new list, <code>store</code>. The list methods <code>.extend</code> or <code>.append</code> are used depending on whether the result of <code>split_by_punctuation()</code> is a list or string. The new list, <code>store</code>, is returned. If <code>words</code> is neither a list nor a string a TypeError is raised telling the user that the argument given to <code>punctuation_check()</code> must be either a string or list of strings. 


In [5]:
def split_by_punctuation(word):

    type_check(word, str)
    
    # Split 'word' by any non-alphanumeric characters
    split = re.findall("\\w+", word)

    # Remove any duplicate words, if there are any
    # Use dict.fromkeys to preserve order
    drop_duplicates = list(dict.fromkeys(split))

    # return a list, if multiple unique words are found, otherwise return the single word
    if len(drop_duplicates) > 1:
        return drop_duplicates
    else:
        return drop_duplicates[0]

In [6]:
def punctuation_check(words):

    # Perform punctuation check if words is a string
    if isinstance(words, str):  
        return split_by_punctuation(words)

    # Split each element in the list at any non-alphanumeric character
    elif isinstance(words, list):  
        store = []
        for word in words:
            split_result = split_by_punctuation(word)

            # Use extend to ensure each element of the list is added separately
            if isinstance(split_result, list):  
                store.extend(split_result)  
                
            # If the word was not split, use append to add the whole split_result
            else:  
                store.append(split_result)
                
        return store  

    else:
        raise TypeError("'{}' must be a string or a list of strings".format(words))

#### Table formatting
    
    1. calc_col_widths(data)

This function takes one parameter,<code>data</code>. It checks the argument given is a dictionary, then performs additional type checks for the key-value pairs, which must be strings and integers respectively. The dictionary supplied is used to first calculate the total of all the values, from each key-value pair, storing this as the variable <code>total</code>. The function then loops over each key-value pair and computes the longest key and value from each item of the dictionary, storing the results as variables <code>max_key_length</code> and <code>max_value_length</code>. These are both integers and refer to the number of character spaces needed for each column of a formatted table, where the table will be formed from the keys and values from the dictionary, <code>data</code>. This function returns the variables <code>total</code>,  <code>max_key_length</code>, and <code>max_value_length</code>.

    2. create_dashed_row(key_col_width, value_col_width)

This function creates a dashed row for a formatted table. It takes two integer parameters, <code>key_col_width</code> and <code>value_col_width</code>. These parameters specify the widths of two columns in the table. The function uses f-strings to create a dashed line, where dashes are formed of '-' characters, and column separators are '|' characters. The length of the dashed row is the specified widths given as arguments with two additional characters to allow for the separators. The function returns the formatted dashed row as a string.

    3. create_table_rows(dict_input, key_col_width, value_col_width)

This function takes three parameters. The first, <code>dict_input</code> is a dictionary that will be used to generate rows of a table, where the key items correspond to the left hand column cells and the value items correspond to the right hand column cells of a table. String items from the dictionary are left aligned and integers are right aligned using f-string formatting. The length of the columns are specified using the parameters <code>key_col_width</code> and <code>value_col_width</code>. Columns are separated using a '|' character. Each row created is appended to a list, which is then joined together using the <code>.join</code> string method. This creates a sinlge string, with a new line character separating each row. The function returns this single string.

    4. create_formatted_table(data)

This function creates a formatted table, using <code>calc_col_widths()</code>, <code>create_dashed_row()</code>, and <code>create_table_rows()</code>. It has one parameter, <code>data</code>, which must be a dictionary. It generates column widths using <code>calc_col_widths()</code>, which also validates that <code>data</code> is a dictionary. A variable, <code>table</code> is initialised as an empty list, then two dictionaries <code>header_row</code> and <code>total_row</code> are generated. The function then creates a formatted table using the outputs from <code>create_dashed_row()</code> and <code>create_table_rows()</code>. It joins each row into a single string, again using the <code>.join</code> string method. Each row is separated by by a new line character, setting new rows for each new set of key-value pairs. This string is returned. 


In [7]:
def calc_col_widths(data):

    type_check(data, dict)
    for key, value in data.items():
        type_check(key, str)
        type_check(value, int)
 
    total = sum(data.values())

    # Set initial column widths based on header and footer values
    max_key_length = len("TOTAL")
    max_value_length = len("COUNT")

    # Update column widths using the longest key and value in data
    for key, value in data.items():
        max_key_length = max(max_key_length, len(str(key)))
        max_value_length = max(max_value_length, len(str(value)))

    return max_key_length, max(max_value_length, len(str(total))), total

In [8]:
def create_dashed_row(key_col_width, value_col_width):

    type_check(key_col_width, int)
    type_check(value_col_width, int)

    # Create a dashed row string, adding two to the specified column widths to account for column separators '|'
    dash_row = f"|{'-' * (key_col_width + 2)}|{'-' * (value_col_width + 2)}|"
    return dash_row

In [9]:
def create_table_rows(dict_input, key_col_width, value_col_width):

    type_check(dict_input, dict)
    type_check(key_col_width, int)
    type_check(value_col_width, int)
    
    rows = []

    # Iterate over the dictionary items, to create formatted rows
    for key, value in dict_input.items():
        
        # If both the keys and values of the dictionary are strings, left align both
        if (isinstance(key, str) and isinstance(value, str)):
                rows.append(f"| {key:<{key_col_width}} | {value:<{value_col_width}} |")

        # If the key is a string and corresponding value is an integer, left align strings and right align integers    
        elif (isinstance(key, str) and (isinstance(value, int))):
                rows.append(f"| {key:<{key_col_width}} | {value:>{value_col_width}} |")
            
        else:
            raise TypeError("Dictionary contains incorrect data types. Keys must be of type string. Values must be of either type string or integer.")

    # Join the list into a single string, separating each list element by a new line
    return "\n".join(rows)

In [10]:
def create_formatted_table(data):

    # Calculate column widths and total count
    key_col_width, value_col_width, total = calc_col_widths(data)
   
    table = []

    # Create dictionaries for the header and total row
    header_row = {"WORD": "COUNT"}
    total_row = {"TOTAL": total}

    # Format the header row, with dashed rows above and below
    table.append(create_dashed_row(key_col_width, value_col_width))
    table.append(create_table_rows(header_row, key_col_width, value_col_width))
    table.append(create_dashed_row(key_col_width, value_col_width))

    # Format the rows of the table
    table.append(create_table_rows(data, key_col_width, value_col_width))

    # Format the total row, with a dashed row above and below
    table.append(create_dashed_row(key_col_width, value_col_width))
    table.append(create_table_rows(total_row,key_col_width, value_col_width))
    table.append(create_dashed_row(key_col_width, value_col_width))

    # Join each element in 'table' together as a string, using a new line to separate each item
    return "\n".join(table)

#### Word count function

    1. count_words(search_terms, words)
    
This function takes two parameters, <code>search_terms</code>, and <code>words</code>. The first argument must be a string or a list of strings, and the second must be a list of strings. The function computes the number of times each word within <code>search_terms</code> appears in the list <code>words</code>. The behaviour is slightly different depending on the type of <code>search_terms</code>. If <code>search_terms</code> is a string, the function checks how many times the string appears in <code>words</code> and returns the count within a sentence using f-strings. If <code>search_terms</code> is a list of strings, this function generates a dictionary where each element in <code>search_terms</code> becomes a key and its value is set to 0. The function then counts how many time each key appears in <code>words</code> and uses <code>create_formatted_table()</code> to return a table with the counts of each word in <code>search_terms</code> given. If <code>search_terms</code> is neither a string nor a list of strings a TypeError is raised.

In [11]:
def count_words(search_terms, words):

    list_of_str_check(words)

    # If search_terms is a string, count the number of times it appears in 'words'
    if isinstance(search_terms, str):
        count = 0
        for word in words:
            if search_terms == word:
                count += 1        
                
        # Return the word count in a sentence
        return f"The word '{search_terms}' appears {str(count)} times."

    # If search_terms is a list of strings, count the number of time each element in the list appears in 'words'
    elif isinstance(search_terms, list):
        
        # Set each term to 0 in a made from 'search_terms', to initialise a count
        aggregates = {term: 0 for term in search_terms}
        for word in words:
            if word in aggregates:
                aggregates[word] += 1

        # Return the formatted table
        return create_formatted_table(aggregates)

    else:
        raise TypeError("Search terms must be either of type string or list.")

### Main Function

    1. word_count_summary(file_path, search_terms)

This function reads a text file and counts the occurrences of each specified search term. It takes two arguments. The first, <code>file_path</code> is a string of the file path that we want to read into the program. The second <code>search_terms</code> is either a string or a list of strings. These are the words that we want to count within the text file. 

The function uses <code>input_checks()</code> to check that the file_path is a string and <code>search_terms</code> is either a string of a list of strings. It then calls <code>puncutation_check()</code> to clean <code>search_terms</code>, removing any non-alphanumeric characters, such as punctuation. 

This function then opens and reads the file specified by <code>file_path</code>, storing this as a string. It then finds all words within this string using <code>re.findall("\\w+", text)</code>, which extracts alphanumeric words and saves the result as a list, <code>words</code>.

The function then counts the number of times each of the terms in <code>seach_terms</code> appears in the list <code>words</code> using <code>count_words()</code> and handling the cases when <code>search_terms</code> is a string differently to when <code>search_terms</code> is a list of strings. 

If <code>search_terms</code> is a single string this function returns the word count for the specified search_term within a sentence. If <code>search_terms</code> is a list of strings, then the function returns a string which represents a formatted table containing the counts of each search term. 

Calling <code>print(word_count_summary())</code> allows these returned strings to be seen by the user.

In [12]:
def word_count_summary(file_path, search_terms):

    # Check data types for both agruments
    input_checks(file_path, search_terms)

    # Remove any non-alphanumeric characters from search_terms, if needed
    search_terms = punctuation_check(search_terms)
    
    # Open the file and read its contents as a string
    with open(file_path, "r", encoding = "utf-8") as file:
        text = file.read()
    
    # Find all words, and store as a list
    words = re.findall("\\w+", text) 

    # Count occurrences of search_terms within words
    result = count_words(search_terms, words)

    return result

## Testing

This section tests how the function performs in a range of scenarios.

For correct usage, it is assumed that the <code>file_path</code> given is a string and points to an existing text file. For this assignment, it is assumed the text files being read are saved one directory up from the current working directory, hence the relative file paths used in testing begin with <code>../</code>. For note, this file path can be changed if needed and the relevant directory containing the text files can be specified.  It is also assumed that <code>search_terms</code> is either a string or list of strings. The tests outlined below are replicated for both acceptable type of search_terms.

This function produces correct word counts when each argument is of a valid type. The performance of <code>word_count_summary()</code> is tested on various string inputs, to assess its design. These include:
* All lower case characters
* All upper case characters
* A combination of upper and lower case characters
* Case insensitivity tested (e.g. "The", "the")
* A string containing numeric characters
* A string containing punctuation, (e.g. apostrophe)
* A string containing a space
* Single character words
* Repeated words
* Multiple spaces between words
* Special characters within words (e.g. hyphenated words)
* A string with a new line character (e.g. "good\nmorning")

This function raises a TypeError in the following scenarios. This is not explicitly tested as it would cause the notebook to stop running, after clicking the fast forward button. Examples of incorrect usage would include:

 * A file path that is not a string
 * A file path that does not exist
 * A search_terms value that is neither a string, nor a list of strings
 * A search_terms value that contains non-string element(s)
 * A search_terms value that is an empty list, or empty string

In [13]:
# All lower case characters
print(word_count_summary("../a-tale-of-two-cities.txt", "times"))
print(word_count_summary("../pride-and-prejudice.txt", ["lost", "the", "hidden"]))

The word 'times' appears 51 times.
|--------|-------|
| WORD   | COUNT |
|--------|-------|
| lost   |    29 |
| the    |  4060 |
| hidden |     0 |
|--------|-------|
| TOTAL  |  4089 |
|--------|-------|


In [14]:
# All upper case characters
print(word_count_summary("../pride-and-prejudice.txt", "OK"))
print(word_count_summary("../a-tale-of-two-cities.txt", ["COLD", "HOT"]))

The word 'OK' appears 0 times.
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| COLD  |     0 |
| HOT   |     0 |
|-------|-------|
| TOTAL |     0 |
|-------|-------|


In [15]:
# A combination of upper and lower case character
print(word_count_summary("../pride-and-prejudice.txt", "Elizabeth"))
print(word_count_summary("../pride-and-prejudice.txt", ["Jane", "Elizabeth", "Mary", "Kitty", "Lydia"]))

The word 'Elizabeth' appears 634 times.
|-----------|-------|
| WORD      | COUNT |
|-----------|-------|
| Jane      |   292 |
| Elizabeth |   634 |
| Mary      |    39 |
| Kitty     |    71 |
| Lydia     |   170 |
|-----------|-------|
| TOTAL     |  1206 |
|-----------|-------|


In [16]:
# Case insensitivity
print(word_count_summary("../pride-and-prejudice.txt", "It"))
print(word_count_summary("../pride-and-prejudice.txt", ["The", "the", "it"]))

The word 'It' appears 247 times.
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| The   |   273 |
| the   |  4060 |
| it    |  1288 |
|-------|-------|
| TOTAL |  5621 |
|-------|-------|


In [17]:
# A string containing numeric characters
print(word_count_summary("../a-tale-of-two-cities.txt", "0"))
print(word_count_summary("../pride-and-prejudice.txt", ["4", "7", "2"]))

The word '0' appears 0 times.
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| 4     |     1 |
| 7     |     1 |
| 2     |     2 |
|-------|-------|
| TOTAL |     4 |
|-------|-------|


In [18]:
# A string containing punctuation
print(word_count_summary("../pride-and-prejudice.txt", "they're"))
print(word_count_summary("../a-tale-of-two-cities.txt", ["they're", "stop!", "that?"]))

|-------|-------|
| WORD  | COUNT |
|-------|-------|
| they  |   474 |
| re    |     4 |
|-------|-------|
| TOTAL |   478 |
|-------|-------|
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| they  |   457 |
| re    |    11 |
| stop  |    16 |
| that  |  1827 |
|-------|-------|
| TOTAL |  2311 |
|-------|-------|


In [19]:
# A string containing a space character
print(word_count_summary("../a-tale-of-two-cities.txt", "he said"))
print(word_count_summary("../pride-and-prejudice.txt", ["this can", "you want"]))

|-------|-------|
| WORD  | COUNT |
|-------|-------|
| he    |  1455 |
| said  |   661 |
|-------|-------|
| TOTAL |  2116 |
|-------|-------|
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| this  |   375 |
| can   |   202 |
| you   |  1129 |
| want  |    44 |
|-------|-------|
| TOTAL |  1750 |
|-------|-------|


In [20]:
# Single character strings
print(word_count_summary("../pride-and-prejudice.txt", "a"))
print(word_count_summary("../a-tale-of-two-cities.txt", ["I", "A"]))

The word 'a' appears 1899 times.
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| I     |  1968 |
| A     |   140 |
|-------|-------|
| TOTAL |  2108 |
|-------|-------|


In [21]:
# Repeated words
print(word_count_summary("../pride-and-prejudice.txt", "done done done"))
print(word_count_summary("../pride-and-prejudice.txt", ["said said it", "said", "him", "done"]))

The word 'done' appears 92 times.
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| said  |   402 |
| it    |  1288 |
| him   |   752 |
| done  |    92 |
|-------|-------|
| TOTAL |  2534 |
|-------|-------|


In [22]:
# Multiple spaces bewteen words
print(word_count_summary("../pride-and-prejudice.txt", "how   did"))
print(word_count_summary("../a-tale-of-two-cities.txt", ["you   lost   me", "this  can"]))

|-------|-------|
| WORD  | COUNT |
|-------|-------|
| how   |   166 |
| did   |   252 |
|-------|-------|
| TOTAL |   418 |
|-------|-------|
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| you   |  1177 |
| lost  |    41 |
| me    |   526 |
| this  |   498 |
| can   |   126 |
|-------|-------|
| TOTAL |  2368 |
|-------|-------|


In [23]:
# Special characters between words
print(word_count_summary("../a-tale-of-two-cities.txt", "dry-run"))
print(word_count_summary("../pride-and-prejudice.txt", ["magic#gold", "you-want"]))

|-------|-------|
| WORD  | COUNT |
|-------|-------|
| dry   |     9 |
| run   |    14 |
|-------|-------|
| TOTAL |    23 |
|-------|-------|
|-------|-------|
| WORD  | COUNT |
|-------|-------|
| magic |     0 |
| gold  |     0 |
| you   |  1129 |
| want  |    44 |
|-------|-------|
| TOTAL |  1173 |
|-------|-------|


In [24]:
# A new line character
print(word_count_summary("../pride-and-prejudice.txt", "Good\nmorning"))

|---------|-------|
| WORD    | COUNT |
|---------|-------|
| Good    |    14 |
| morning |    79 |
|---------|-------|
| TOTAL   |    93 |
|---------|-------|
