# Scripting languages: Assignment 2

Deadline: Wednesday 29 November 2023, 11:59pm

You are required to submit this assignment in notebook format (`.ipynb`). If you're using Jupyter Notebook on your own computer, this file will be created automatically. If you're using Google Colab, you can create an `.ipynb` file by selecting File -> Download -> Download .ipynb.

You are encouraged to use (an appropriate amount of) comments in order to explain what your code is doing, or to make use of the notebook text blocks in order to do so.

You can upload your file on Toledo (under 'Assignments'
$\rightarrow$ 'Assignment 3'). The deadline for submission is **Wednesday 29 November 2023, 11:59pm**.

## Exercise 1: a `TextAnalysis` class

Create a `TextAnalysis` class, that computes a number of text statistics for a given document (plain txt-file). Make sure your class has the following functionality:

* your class can be initialized using a file location; upon initialization, your class will read the document from the file location, and will make sure it is properly preprocessed (segmented into sentences and tokenized);

* your class has a method `average_word_length`, which computes the average length of words (i.e. average number of characters per word) in the document;

* your class has a method `type_token_ratio`, which computes the total number of **unique** words (types) divided by the total number of words (tokens) in the document. As an example, the phrase *the dog and the cat and the man* contains 5 types and 8 tokens (and its type token ratio is thus 5/8);

* you class has a method `hapax_ratio`, which computes the number of words occurring exactly once (i.e. [hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon)) divided by the total number of words;

* your class has a method `average_sentence_length`, which computes the average number of words per sentence in the document;



In [None]:
#Must import nltk in order to operate nltk.word_tokenize() and nltk.sent_tokenize().
import nltk


#Establishing class and initialization
class TextAnalysis:
    def __init__(self, filename):
        self.filename = filename
        self.text_read = self.read_text()
        self.tokenized_text = self.tokenize_and_filter()
        self.preprocessed_text = self.preprocess()


    #Defining (3) methods that will be executed when the object is instantiated.
    def read_text(self):
        with open(self.filename, 'rt' , encoding='utf8') as infile:
            return infile.read()
            #Opens and reads the given file.

    def tokenize_and_filter(self):
        return [token.lower() for token in nltk.word_tokenize(self.text_read) if token not in ".,?!:;()[]''``*"]
        #Returns one large list that contains all the tokens (in lowercase form) of the given file minus punctuation.
        #Note that certains words (such as "we'll") will return as two separate tokens.

    def preprocess(self):
        sentences = nltk.sent_tokenize(self.text_read) #separates sentences, but capitalization/punctuation remains.
        preprocessed_sentences = [] #list where "cleaned" sentences (no capitalization, no punctuation) will be looped into.
        for sent in sentences:
            tokenized_and_filtered_text = [token.lower() for token in nltk.word_tokenize(sent)
                                           if token not in ".,?!:;()[]''``*"]
            preprocessed_sentences.append(tokenized_and_filtered_text)
        return preprocessed_sentences
        #Returns one large list with nested lists that are defined by the given files' sentences.
        #Note that certain words (such as "we'll") will return as two separate tokens.



    #Defining (3) methods to calculate statistics.
    def average_word_length(self):
        total_number_of_characters = sum(len(word) for word in self.tokenized_text)
        total_words = len(self.tokenized_text)
        average_length_of_words = total_number_of_characters / total_words
        result_string = f'The Average Word Length is {average_length_of_words} letters long.'
        return result_string

    def type_token_ratio(self):
        word_counts = {} #Creating dictionary defined by words (key) and their frequency (value)
        for word in self.tokenized_text:
            if not word in word_counts:
                word_counts[word] = 1
            else:
                word_counts[word] += 1

        #Note how type_counts uses the number of different words (type) and not each word's total freq.
        type_count = len([word for word, freq in word_counts.items()])
        total_words = len(self.tokenized_text)
        type_token_ratio = type_count/total_words
        result_string = f'The Type-to-Token Ratio is {type_count}/{total_words} or {type_token_ratio:.2f}.'
        #Rounding to two decimals.
        return result_string

    def hapax_ratio(self):
        word_counts = {} #Creating dictionary defined by words (keys) and their frequency (values)
        for word in self.tokenized_text:
            if not word in word_counts:
                word_counts[word] = 1
            else:
                word_counts[word] += 1

        #Note how hapax_count uses the number of different words that occur only once
        hapax_count = len([word for word, freq in word_counts.items() if freq == 1])
        total_words = len(self.tokenized_text)
        hapax_ratio = hapax_count/total_words
        result_string = f'The Hapax_Ratio is {hapax_count}/{total_words} or {hapax_ratio:.2f}.'
        #Rounding to two decimals.
        return result_string

    def average_sentence_length(self):
        total_words = len(self.tokenized_text)
        total_number_of_sentences = len(self.preprocessed_text) #Now using preprocessed_text defined earlier for sentences.
        average_sentence_length = total_words/total_number_of_sentences
        result_string = f'The average sentence length is {average_sentence_length:.2f} words per sentence.'
        #Rounding to two decimals.
        return result_string


#Executing the Code
text_analysis = TextAnalysis('insert filename.txt here')
print(text_analysis.average_word_length())
print(text_analysis.type_token_ratio())
print(text_analysis.hapax_ratio())
print(text_analysis.average_sentence_length())
