# Foundations of Language Technology 2022/23

# Foundations of Language Technology WS 22/23: Homework 03

Please send your solution as a zip-file containing all *.ipynb files that were provided for this homework.(.ipynb). Include comments in your program code to make it easier readable. 

**Naming template: Group_X_homework_Y.ipynb, Group_X_homework_Y.zip**

Please replace X with your group number and Y with the homework number. Submissions that do not follow these rules will not be considered. 

Please only modify the template in the specified markdown and code cells (e.g. `YOUR CODE / ANSWER / IMPORTS HERE`). 
Some cells are left blank on purpose. Please do not modify these cells, because they are used to autograde your submission.

The deadline for the homework is **Friday, 13/01/2023**. Late submissions will not be accepted.

In [None]:
# YOUR IMPORTS HERE
from typing import List, Dict
import nltk, random, csv
import re
from typing import Tuple
import math
import inspect

If you are using Google Colab, you can mount your Google Drive to access files from there because the file `dataset.csv` which is used in this homework might be too large to upload to Colab.
(Google Colab is not required for this homework).


To mount your Google Drive, run the following cell and follow the instructions.
Save the file `dataset.csv` in your Google Drive and change the path in the following cell to the path where you saved the file.

If you need more information about how to use Google Colab with Google Drive, you can find it [here](https://towardsdatascience.com/different-ways-to-connect-google-drive-to-a-google-colab-notebook-pt-1-de03433d2f7a).

In [None]:
# GOOGLE COLAB ONLY
# remove the # in the following lines to mount your Google Drive if you are using Google Colab, then execute this cell and follow the instructions to mount your Google Drive if you are using Google Colab

# from google.colab import drive
# drive.mount('/content/drive')


# to access the file dataset.csv, store it in your Google Drive and change the path to the path where you saved the file

# path = "/content/drive/MyDrive/dataset.csv"

# you can use the file like this:
# f = open(path, "r")

# Part A: Language Identifier (27 points)

In this task, you will implement a language identifier. Given a text, it predicts the language that this text is written in.

## Task A.1: Data for the Language Identifier (9 points)
We use the data from `dataset.csv` to train, test, and evaluate the language identifier. The file contains strings of text and their label (the language of the text sample).
We will only use samples in French, English, and Swedish for our language identifier.
* a) (3 points) Open the CSV file. Only keep the samples that are in French, English, or Swedish and store them in the variable `language_data`. The variable should be a list of tuples, where each tuple contains the text and the language label.

*Hint: You should skip the first line of the CSV file, because it contains the column names (header).*

In [None]:
languages = ['French', 'English', 'Swedish']
language_data = []

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# PUBLIC TEST (2 points)
# assert that the number of samples is correct
assert len(language_data) == 3000, "You did not read the correct number of texts."
# assert list of tuples
assert all([isinstance(x, tuple) for x in language_data]), "You did not read the correct data."

In [None]:
#  HIDDEN TESTS, DO NOT MODIFY THIS CELL (1 point)

* b) (4 points) Now we can split the data into a training, dev, and a test set. Write a function `create_dataset` that takes a list of samples and splits it into a training, dev, and a test set. The function should return three lists, where each list contains the samples for the corresponding set. The function should take the following parameters:
    * `dataset`: the list of samples (list of tuples)
    * `train_ratio`: the ratio of samples that should be in the training set (float)
    * `dev_ratio`: the ratio of samples that should be in the dev set (float)
    * `test_ratio`: the ratio of samples that should be in the test set (float)
    * `random_seed`: the seed for the random number generator
Use the random seed to shuffle the samples before splitting them into the different sets.

In [None]:
def create_dataset(dataset: list, train_ratio: float, dev_ratio: float, test_ratio: float, random_seed: int) -> (
        list, list, list):
    """
    Splits the dataset into a training, dev, and a test set.
    :param dataset: the list of samples (list of tuples)
    :param train_ratio: the ratio of samples that should be in the training set (float)
    :param dev_ratio: the ratio of samples that should be in the dev set (float)
    :param test_ratio: the ratio of samples that should be in the test set (float)
    :param random_seed: the seed for the random number generator
    :return: three lists, where each list contains the samples for the corresponding set
    """
    train_set, dev_set, test_set = [], [], []
    # YOUR CODE HERE
    raise NotImplementedError()
    return train_set, dev_set, test_set

In [None]:
seed = 42
train_set, dev_set, test_set = create_dataset(dataset=language_data, train_ratio=0.7, dev_ratio=0.15, test_ratio=0.15,
                                              random_seed=seed)
print("Train={}, Dev={}, Test={}".format(len(train_set), len(dev_set), len(test_set)))

In [None]:
# PUBLIC TEST (3 points)
# assert train set ratio is close to 0.7
assert math.isclose(len(train_set) / len(language_data), 0.7,
                    abs_tol=1e-8), "The ratio of samples in the training set is incorrect."
# assert dev set ratio is close to 0.15
assert math.isclose(len(dev_set) / len(language_data), 0.15,
                    abs_tol=1e-8), "The ratio of samples in the dev set is incorrect."
# assert test set ratio is close to 0.15
assert math.isclose(len(test_set) / len(language_data), 0.15,
                    abs_tol=1e-8), "The ratio of samples in the test set is incorrect."

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (1 point)

* c) (2 points) Before any NLP task, we should perform pre-processing to match the text format as in the data used to build the model. In our case, we need to tokenize the input text and lowercase the tokens. Implement a function `normalize_text` that takes a text and returns a list of strings. The function should take the following parameter:
    * `text`: the text to be normalized (string)

    The function should return a list of strings where each string is a token from the text. The tokens should be lowercase. You should use the `nltk.tokenize.word_tokenize` function to tokenize the text.

In [None]:
def normalize_text(text) -> List[str]:
    """
    Normalize the text by tokenizing and lowercasing the tokens.
    :param text: the text to be normalized (string)
    :return: a list of strings where each string is a token from the text (list of strings)
    """
    results: List[str] = []
    # YOUR CODE HERE
    raise NotImplementedError()
    return results

In [None]:
# PUBLIC TEST (2 points)
# assert the function returns the correct tokens
assert normalize_text("Hello   World!") == ["hello", "world", "!"], "The function returns the incorrect tokens."

## Task A.2: Language Identifier Using Character Bigram Language Model (12 points)

Now you will implement the language identifier, i.e. a function that takes a given text and outputs the language it thinks the text is written in.
The function should base its decision on the frequency of character bigrams in each language.

*Remember: A character bigram is a sequence of two characters. For example, the text "Hello" contains the following character bigrams: "He", "el", "ll", "lo".
The language identifier might decide that the text is written in English, because the character bigram "ll" occurs more often in English than in French or Swedish.*

We will build our language model based on the samples we retrieved from the dataset in Task A.1.

* a) (3 points) Implement a function `build_language_models(languages, words)`. The function should take a list of languages and a list of words and build a language model for each language. The function should return a conditional frequency distribution (nltk.ConditionalFreqDist) that contains the frequency of character bigrams for each language. The function should take the following parameters:
    * `languages`: the list of languages (list of strings) for which the language model should be built
    * `words`: a dictionary that maps each language to a list of words (list of strings) in that language, e.g. `{"English": ["Hello", "World"], "French": ["Bonjour", "Monde"]}`

    The returned conditional frequency distribution should have the following structure:
    * The conditions of the conditional frequency distribution are the languages.
    * The values are the **lower cased** character bigrams found in `words[language]` as tuples, e.g. `("h", "e"), ("e", "l"), ("l", "l"), ("l", "o")`.

Hints:
- Use `nltk.bigrams` to get the character bigrams from a word. Make sure to convert the word to lower case before getting the character bigrams.
- Use the function `nltk.ConditionalFreqDist().update(...)` to update the conditional frequency distribution with the character bigrams for a given language.

In [None]:
def build_language_models(languages: List[str], words: Dict) -> nltk.ConditionalFreqDist:
    """
    Return conditional frequency distribution where:
        - the languages are the conditions
        - the values are the lower cased character bigrams

    Parameters
    ------
    languages: list of strings
        list of language names
    words: dict of lists of strings
        dictionary where the keys are the language names and the values are the words in the language
    """
    cfd = None
    # YOUR CODE HERE
    raise NotImplementedError()
    return cfd

In [None]:
# PUBLIC TEST (2 points)
# assert the conditional frequency distribution has the correct structure
cfd = build_language_models(languages=["English", "French", "Swedish"],
                            words={"English": ["Hello", "World"],
                                   "French": ["Bonjour", "Monde"],
                                   "Swedish": ["Hej", "Världen"]})
assert cfd.conditions() == ["English", "French",
                            "Swedish"], "The conditional frequency distribution has the wrong conditions."

# assert the format of the values is correct
assert all([isinstance(cfd[language].most_common(1)[0][0], tuple) for language in
            cfd.conditions()]), "The values of the conditional frequency distribution are incorrect."

# assert the most common character bigram is correct for French
assert cfd["French"].most_common(1)[0][0] == ("o", "n"), "The most common character bigram for French is incorrect."

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (1 point)

* b) (4 points) To create the language models, we should be able to access the words from the dataset by language. Write a function `get_words_by_language` that takes a list of samples and returns a dictionary that maps languages to words. The function should take the following parameters:
    * `dataset`: the list of samples (list of tuples) containing the texts and their language
    * `languages`: the list of languages (list of strings) for which the texts should be returned

    The returned dictionary should have the following structure:
    * The keys are the languages.
    * The values are all the words (list of strings) from the samples for the corresponding language as a single list.
    The dictionary should only contain the languages that are specified in the parameter `languages`.

    The function should use the function `normalize_text` to normalize (tokenize and lowercase) the text of each sample before splitting it into words.

In [None]:
def get_words_by_language(languages: list, dataset: list) -> dict:
    """
    Return a dictionary that maps languages to words.

    Parameters
    ------
    languages: list of strings
        list of language names that should be in the returned dictionary
    dataset: list of tuples
        list of samples containing the texts and their language
    """
    words_by_language = dict()
    # YOUR CODE HERE
    raise NotImplementedError()
    return words_by_language

In [None]:
# PUBLIC TEST (2 points)
# assert the dictionary has the correct structure
words_by_language = get_words_by_language(languages=["English", "French", "Swedish"],
                                          dataset=[("Hello World!", "English"),
                                                   ("Bonjour Monde!", "French"),
                                                   ("Hej Världen!", "Swedish")])
assert words_by_language == {"English": ["hello", "world", "!"], "French": ["bonjour", "monde", "!"],
                             "Swedish": ["hej", "världen", "!"]}, "The dictionary has the incorrect structure."
# assert the dictionary contains the correct words
assert words_by_language["English"] == ["hello", "world", "!"], "The dictionary contains the incorrect words."

Now we can build the language models. We use the function `get_words_by_language` to get the texts for the languages English, French, and Swedish from the training set.

In [None]:
language_samples_train = get_words_by_language(["English", "French", "Swedish"], train_set)
print("Number of English words: {}".format(len(language_samples_train["English"])))
print("Number of French words: {}".format(len(language_samples_train["French"])))
print("Number of Swedish words: {}".format(len(language_samples_train["Swedish"])))
print("Number of words in total: {}".format(
    len(language_samples_train["English"]) + len(language_samples_train["French"]) + len(
        language_samples_train["Swedish"])))

In [None]:
# PUBLIC TEST (2 points)
# assert the dictionary has likely correct numbers of words
assert len(language_samples_train["English"]) > 48000, "Number of English words is incorrect!"
assert len(language_samples_train["French"]) > 46000, "Number of English words is incorrect!"
assert len(language_samples_train["Swedish"]) > 34000, "Number of English words is incorrect!"

 * c) (5 points) Use the function `build_language_models` to build the language models for the languages English, French, and Swedish (`languages`) based on the training set `language_samples_train`.

In [None]:
# Build the language model
language_model_cfd = None
# Call the function `build_language_models` with correct arguments from above

# YOUR CODE HERE
raise NotImplementedError()

# print the conditions of the conditional frequency distribution
language_model_cfd.conditions()

In [None]:
# PUBLIC TEST (2 points)
# assert the conditional frequency distribution has the correct structure
assert len(
    language_model_cfd.conditions()) == 3, "The number of conditions in the conditional frequency distribution is incorrect."
# assert the conditional frequency distribution has the correct values
assert language_model_cfd['French'].freq(
    ('é', 'c')) > 0.001, "The frequency of the character bigram 'éc' in French is incorrect."

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (3 points)

## Task A.3: Predicting Languages (6 points)


Implement a function `predict_language(language_model_cfd,text)` that takes a conditional frequency distribution of character bigrams and a text and predicts the language of the text. The function should return the language with the highest probability. The function should take the following parameters:
    * `language_model_cfd`: the conditional frequency distribution of character bigrams
    * `text`: the text for which the language should be predicted

*Hints:*
   * In this task, we utilize the frequency of character bigrams in the language to calculate the language matching score.
   * The language matching score is the sum of the frequency of the character bigrams in the text.
   * Calculate the language matching score for each language in the conditional frequency distribution and return the language with the highest score.
   * Formula for one language: $score_{language} = \sum_{bigram \in text} freq_{language}(bigram)$
   * where $freq_{language}(bigram)$ is the frequency of the character bigram in this language
   * Return: $language \in \text{language_model_cfd.conditions()}$ with the highest score
   * If the language matching score is the same for two or more languages, return the language that appears first in the list of languages in the conditional frequency distribution.
   * Each occurrence of a character bigram contributes to the score, thus the bigrams, that are indicative of a particular language (have high frequency) and appear often in the text, contribute the most to the final score for each language.
   * The higher the score the more likely is that the text is written in that language.
   * You can use the `nltk.ngrams` function to generate character bigrams from the text.

In [None]:
def predict_language(language_model_cfd: nltk.ConditionalFreqDist, text: str) -> str:
    """
    Predict/Guess the language for the given text

    Parameters
    -------
    language_model_cfd:
        dict-like object (ConditionalFreqDist) that maps languages to character bigrams
    text: string
        a given text to be predicted
    return: string
        Name of the most likely language from the given text (key of the language_model_cfd)
    """
    max_score = 0
    most_likely_language = ""

    # Normalize the text before predicting the language
    words = normalize_text(text)

    # Calculate the language matching score for each language
    for language in language_model_cfd.conditions():
        score = 0
        # Calculate the score for each language and keep track of the language with the highest score
        # YOUR CODE HERE
        raise NotImplementedError()

    # Return the language with the highest score
    return most_likely_language


In [None]:
# PUBLIC TEST (4 points)
# assert the function returns the correct language
text1 = "Du gamla, Du fria, Du fjällhöga nord Du tysta, Du glädjerika sköna! Jag hälsar Dig, vänaste land uppå jord, Din sol, Din himmel, Dina ängder gröna."
assert predict_language(language_model_cfd, text1) == "Swedish", "The prediction for text1 is incorrect."

test_2 = "Tous les hommes naissent égaux. Le Créateur nous a donné des droits inviolables, le droit de vivre, le droit d'être libre et le droit de réaliser notre bonheur."
assert predict_language(language_model_cfd, test_2) == "French", "The prediction for text2 is incorrect."

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (2 points)

In [None]:
# print the samples for which the language prediction is incorrect
for text, language in test_set:
    predicted_language = predict_language(language_model_cfd, text)
    predicted_language = predict_language(language_model_cfd, text)
    if predicted_language != language:
        print("Predicted language: {}, Actual language: {}".format(predicted_language, language))
        print(text)
        print()

# Part B: Language Identifier using Naive Bayes (31 points)

## Task B.1: Create a feature extractor function (10 points)

Now we will use the Naive Bayes classifier to predict the language of a text. We will use a different feature extractor function to create features for the Naive Bayes classifier.

The function `extract_features()` should take a text and return a dictionary of features. The features should be
* `avg_vowels`: the average number of vowels [a, e, i, o, u, y] in the text
* `avg_word_length`: the average word length in the text
* `num_accented_chars`: the **total** number of accented characters [á, é, í, ó, ú] in the text
* `num_umlaut_chars`: the **total** number of umlaut characters [ä, ö, ü] in the text
* `num_punctuation_chars`: the **total** number of punctuation characters [. , ! ? ; :] in the text
* `num_uppercase_chars`: the **total** number of uppercase characters in the text
* `num_consecutive_vowels`: the **total** number of consecutive vowels in the text (e.g. "aeiou" has 5 consecutive vowels)
* `num_consecutive_consonants`: the **total** number of consecutive consonants in the text (e.g. "bcdfghjklmnpqrstvwxyz" has 21 consecutive consonants)
* **all lowercased character bigrams** in the text (the presence of a character trigram can be indicated by setting the value of the feature to `True` or `1`) (e.g. "hello" has the character bigrams "he", "el", "ll", "lo")


Hint: You can use the `nltk.tokenize.word_tokenize` function to tokenize the text where necessary.

In [None]:
# For reference
accented_chars = u"áéíóú"
umlaut_chars = u"äöü"
vowels = "aeiouy"
consonants = "bcdfghjklmnpqrstvwxyz"

In [None]:
def extract_features(text: str) -> Dict[str, float]:
    """
    Extract features from the given text
    text: string
        a given text to extract features from
    return: dict
        a dictionary of features
    """
    feature_set = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return feature_set

In [None]:
# PUBLIC TESTS (5 points)
example = "Bonjour petite fille"
example_features = extract_features(example)
# assert the features contain some of the expected features
assert "num_accented_chars" in example_features, "The feature 'num_accented_chars' is missing."
assert "num_umlaut_chars" in example_features, "The feature 'num_umlaut_chars' is missing."
assert "num_punctuation_chars" in example_features, "The feature 'num_punctuation_chars' is missing."

# assert avg_vowels is close to 0.4
assert math.isclose(example_features["avg_vowels"], 0.4, rel_tol=1e-2), "The value of 'avg_vowels' is incorrect."
# assert avg_word_length is close to 6
assert math.isclose(example_features["avg_word_length"], 6, rel_tol=1e-2), "The value of 'avg_word_length' is incorrect."

In [None]:
# PUBLIC TESTS (2 points)
hello_test = "Hello!"
hello_features = extract_features(hello_test)
# assert the character bigrams are present as features
assert hello_features["he"], "The feature 'he' is missing."
assert hello_features["el"], "The feature 'el' is missing."

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (3 points)
# The tests check `num_uppercase_chars`, `num_consecutive_vowels`, `num_consecutive_consonants`

## Task B.2: Train a Naive Bayes classifier (7 points)
* a) (4 points) Prepare the train and test sets for the Naive Bayes classifier. The feature set should be a list of tuples where the first element is the feature dictionary and the second element is the language label. The train set should be created using the texts and language labels in the `train_set` variable from task A.1. The test set should be created using the texts and language labels in the `test_set` variable from task A.1. Store the train and test feature sets in the `train_features` and `test_features` variables respectively.


    *Hint: You can use the `extract_features` function to create the feature dictionary.*

In [None]:
train_features, test_features = [], []
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# PUBLIC TESTS (4 points)
# assert the train data contains the expected number of texts
assert len(train_features) == 2100, "The train data does not contain the expected number of texts."
# assert the test data contains the expected number of texts
assert len(test_features) == 450, "The test data does not contain the expected number of texts."
# assert the train_features is  in the expected format
assert isinstance(train_features[0], tuple), "The train data is not in the expected format."
# assert the train data contains some of the expected features
assert "avg_vowels" in train_features[0][0], "The feature 'avg_vowels' is missing."

* b) (3 points) Train the Naive Bayes classifier using the training data and evaluate the classifier using the test data. Print the accuracy of the classifier. Print the 10 most informative features of the classifier.
 Use the `classifier` variable to store the trained classifier.

    Hint: You can use the `nltk.classify.accuracy` function to evaluate the classifier.

In [None]:
classifier = None
accuracy = 0.0
top10_most_informative_features = []
# YOUR CODE HERE
raise NotImplementedError()
print(f"Accuracy: {accuracy}")
print("Top 10 most informative features:")
for feature in top10_most_informative_features:
    print(feature)

In [None]:
# PUBLIC TESTS (1 point)
# assert the classifier is trained
assert classifier, "The classifier is not trained."

In [None]:
# PUBLIC TESTS (1 point)
# assert the accuracy is correct
assert accuracy > 0.98, "The accuracy is incorrect."

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (1 point)

## Task B.3: Evaluate the classifiers (14 points)

We want to evaluate both the Naive Bayes classifier and the cfdist classifier from task A.3. We will use the same test set for both classifiers.
The performance of the classifiers is measured using the following metrics:
* **Accuracy**: the percentage of correctly classified texts
* **F1 scores**: the harmonic mean of the precision and recall

Implement a function `evaluate_predictions` that takes as input a list of predictions and a list of gold labels and returns the accuracy, and the F1 scores for each language. The function should return a tuple of the form `(accuracy, f1_scores)`, where `accuracy` is a float and `f1_scores` is a dictionary with language labels as keys and F1 scores as values.
The function should take the following parameters:
* `predictions`: a list of predicted language labels
* `gold_labels`: a list of gold language labels (the correct language labels as in the test set)


The function should return the following values:
* `accuracy`: a float representing the accuracy of the predictions
* `f1_scores`: a dictionary with language labels as keys and F1 scores as values


The function should work for both the Naive Bayes classifier and the cfdist classifier from task A.3.

**Hints:**
    - The accuracy is the percentage of texts for which the language prediction is correct.
    - The F1 score is computed as the harmonic mean of the precision and recall for each language.
    - The formula for the F1 score is: $F1 = 2 \cdot \frac{precision \cdot recall}{precision + recall}$
    - The precision is the percentage of texts for which the language prediction is correct and the language of the text is the predicted language.
    - The recall is the percentage of texts for which the language of the text is the predicted language.
    - You should not use any external libraries (e.g. sklearn, nltk) to compute the metrics, but use the formulas above.

In [None]:
def evaluate_predictions(predictions: List[str], gold_labels: List[str]) -> Tuple[float, Dict[str, float]]:
    """
    Evaluate the predictions of a classifier.
    :param predictions: the predictions of the classifier (a list of language labels)
    :param gold_labels: the gold labels (a list of language labels), i.e. the correct language labels as in the test set
    :return: a tuple of the form (accuracy, f1_scores), where accuracy is a float and f1_scores is a dictionary with language labels as keys and F1 scores as values
    """
    accuracy = 0.0
    f1_scores = {}
    for language in set(gold_labels):
        f1_scores[language] = 0.0
    # YOUR CODE HERE
    raise NotImplementedError()
    return accuracy, f1_scores

In [None]:
# Comparison of the performance of the Naive Bayes classifier with the performance of the cfd model from task A

# predictions on the test set of the Naive Bayes classifier
# uses the test_features variable from task B.2
nb_predictions = [classifier.classify(features) for features, language in test_features]
accuracy_nb, f1_scores_nb = evaluate_predictions(nb_predictions, [l for f, l in test_features])

# predictions on the test set of the cfdist classifier
# uses the test_set variable from task A
cfdist_predictions = [predict_language(language_model_cfd=language_model_cfd, text=t) for t, l in test_set]
accuracy_cfd, f1_scores_cfd = evaluate_predictions(cfdist_predictions, [l for t, l in test_set])

print(f"Accuracy of the Naive Bayes classifier: {accuracy}")
print(f"Accuracy of the cfd model: {accuracy_cfd}")
print(f"F1 scores of the Naive Bayes classifier: {f1_scores_nb}")
print(f"F1 scores of the cfd model: {f1_scores_cfd}")

In [None]:
# PUBLIC TESTS (2 points)
fake_test_1 = ["English", "English", "English", "English", "English", "English", "English", "English", "English",
               "English"]
test_acc, test_f1 = evaluate_predictions(fake_test_1, fake_test_1)

# assert that the accuracy in evaluate_predictions is correct
assert test_acc == 1.0, "The accuracy in evaluate_predictions is incorrect."
# assert that the F1 score in evaluate_predictions is correct
assert test_f1["English"] == 1.0, "The F1 score in evaluate_predictions is incorrect."

In [None]:
# PUBLIC TESTS (2 points)
# assert the function returns the expected values for the Naive Bayes classifier
nb_accuracy, nb_f1_scores = evaluate_predictions(nb_predictions, [l for f, l in test_features])
assert nb_accuracy > 0.93, "The accuracy is incorrect."
assert nb_f1_scores["English"] > 0.93, "The F1 score for the English language is incorrect."

In [None]:
# PUBLIC TESTS (2 points)
# assert the function returns the expected values for the cfd model
cfdist_accuracy, cfdist_f1_scores = evaluate_predictions(cfdist_predictions, [l for t, l in test_set])
assert cfdist_accuracy > 0.95, "The accuracy is incorrect."
assert cfdist_f1_scores["English"] > 0.95, "The F1 score for the English language is incorrect."

In [None]:
# PUBLIC TESTS (2 points)
# assert the performance of Naive Bayes better than CFDist
assert nb_accuracy > cfdist_accuracy, "NB should work better than CFDist"

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (2 points)

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (2 points)
# Evaluate the NB classifier on the dev set

In [None]:
# HIDDEN TESTS, DO NOT MODIFY THIS CELL (2 points)
# evaluate the cfd model on the dev set

# Part C: Chatbot that identifies your language (not graded)
Finally, we show case the use of these classifiers in a chatbot. 
The chatbot can identify the language of the user's input. The chatbot asks the user to enter a text and then prints the language of the text. The chatbot asks the user for texts three times or until the user enters the text `exit`.

For your interests, you can play around with different types of classifiers.

In [None]:
def respond_with_predicted_language(language):
    """
    Prints a response to the user based on the predicted language
    Parameters
    ----------
    language: str
        the predicted language
    """
    return f"You're speaking {language}."

In [None]:
print("Hello, please speak to me, I can guess the language you speak. (only in English, French, Swedish)")
# while loop that asks the user for a text and prints the predicted language three times
for i in range(3):
    text = input("Please enter a text: ")
    if text == "exit":
        break
    features = extract_features(text)
    language = classifier.classify(features)
    # output the predicted language
    print(respond_with_predicted_language(language))