In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab4.ipynb")

# Lab 4: Functions, Classes and Testing

## Instructions
Follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/).

In [None]:
# #Run this cell
import pandas as pd
import warnings
warnings.filterwarnings('ignore') #pandas can get pretty verbose!

Code Quality
rubric={quality:5}

The code that you write for this assignment will be given one overall grade for code quality, see our code quality rubric as a guide to what we are looking for. Also, for this course (and other MDS courses that use R), we are trying to follow the PEP 8 code style. There is a guide you can refer too: https://peps.python.org/pep-0008/

Each code question will also be assessed for code accuracy (i.e., does it do what it is supposed to do?).

### INTRODUCTION

In this lab, you'll be looking at a dataset called MASSIVE which contains ~1M sentences across 50 languages. It is a "parallel corpus", meaning that the same sentences appear in all languages. Two sentences with the same ID in different languages can be considered 'translations' of each other. (Strictly speaking they are not, they are each translations of the English sentence with that ID, but we can ignore that nuance for today.) 

Sentences are referred to as "utterances" in the context of MASSIVE, and they represent things that people would say to virtual assistants, such as "set an alarm", "remind me to water my plants", "what time is my next appointment", etc. Utterances are available as plain text, or with semantic annotation.

You can examine the full dataset on HuggingFace here: https://huggingface.co/datasets/AmazonScience/massive

To keep things manageable for the lab, we are using a slim version of MASSIVE consisting of just 5 languages with 600 utterances each: English, French, German, Korean, and Vietnamese. The data is available in 5 files named "{language}_massive_data.csv". For example you can load the English data like this:

In [None]:
#Run this cell
english = pd.read_csv('english_massive_data.csv', encoding='utf-8')
english.head(5)

Each language file has the same 7 columns:
- `language` is the name of a language
- `split` Each language is split into 'train', 'test', and 'validation' sets. An utterance will only appear in one of these sets.
- `id` is an utterance id. Within a language, each utterance has a unique id. Across languages, utterances will have the same id if they are 'translations' of each other.
- `utt` Raw text of an utterance, written in the conventional orthography of a language
- `annot_utt` An annotated version of the utterance, where some words may have semantic labels indicated by square brackets
- `scenario` The general topic of the sentence (news, alarms, music, datetime, etc.)

## Exercise 1: get_massive_data()
rubric={autograde:12}

### Description

Your first task is to write a new function that takes list of language names, and returns a single pandas dataframe with all the relevant data from the CSV files. Someone would be able to call your function like this:

`massive = get_massive_data(languages=['English', 'French'], split='test')`

Don't forget to write a docstring, explaining what your function does!

### Signature

`get_massive_data(languages: list, split_type: str) -> pandas.DataFrame`

### Arguments

`languages` is a list of languages. This argument is required and should not have a default value. 

`split_type` Options are 'test', 'train', 'validation', or 'all'. This argument is optional, and the default value should be 'all'.

### Return value
The function should return a pandas DataFrame. Use the utterance id as your index. Note that this will create non-unique indexes, since the same id values are used across languages.  Be sure to convert the id to an integer! 

The DataFrame should have these columns:

<table>
<tr><td>'language'</td>	    <td>the name of a language (passed as part of the `languages` argument in this function)</td></tr>
<tr><td>'text'</td>	 	    <td>a natural language sentence (corresponds to the 'utt' column in MASSIVE)</td></tr>
<tr><td>'annotation'</td> 	<td>the same sentence with semantic labelling on words (corresponds to the 'annot_utt' column in MASSIVE)</td></tr>
<tr><td>'scenario'</td>	    <td>a semantic label for a scenario (corresponds to 'scenario' column in MASSIVE)</td></tr>
</table>

For example, with English, French and German loaded, index 0 looks like this.
<table>
<tr><th>id</th>	<th>language</th>	<th>text</th>	<th>annotation</th>	<th>scenario</th></tr>
			
<tr><td>0</td>	<td>en-US</td>	<td>wake me up at five am this week</td>	<td>wake me up at [time : five am] [date : this week]</td>	<td>alarm</td></tr>
<tr><td>0</td>	<td>fr-FR</td>	<td>réveille-moi à cinq heures du matin cette semaine</td>	<td>réveille-moi à [time : cinq heures du matin]</td> <td>alarm</td></tr>
<tr><td>0</td>	<td>de-DE</td>	<td>wecke mich in dieser woche um fünf uhr auf</td>	<td>wecke mich in [date : dieser woche] um [time :...</td>	<td>alarm</td></tr>
</table>

### Hint
Write a loop to load each dataset individually and then use panda's concatenate function


In [None]:
...

In [None]:
grader.check("q1_1")

<!-- BEGIN QUESTION -->

## Exercise 1.2: error handling for `get_massive_data()`
rubric={accuracy: 5}

Add code to the `get_massive_data()` function to handle incorrect language inputs, like 'zz-BB' or 'Navajo'. If a language name is not recognized, then the function should "fail gracefully". This means that instead of raising an error and stopping, it should skip over the incorrect name and keep processing the rest of the list. Before returning, your function should print a warning that contains all the languages that didn't work correctly. You can simply print this warning to screen with `print()`, you do not need to raise an actual Python Warning. 

In [None]:
...

<!-- END QUESTION -->



In [None]:
#HELPER CELL
#If you were not able to complete Exercise 1, then you can use the following function as a replacement for the rest of the lab. 
#Just uncomment the code and run the cell.
#It returns a correctly formatted DataFrame object that you can use. 
#Note that it only returns 3 languages: English, French, and German. This is enough to pass all subsequent exercises.

# import pickle
# def get_massive_data(languages, split):
#     with open('massive_dataframe.pkl', mode='rb') as f:
#         massive = pickle.load(f)
#     return massive

## Exercise 2: get_translations()
rubric={autograde:12}

### Description
Now that you have MASSIVE formated as a dataframe, your next task is to write a search function that takes an utterance id, and returns all the translations of that utterance. For example, to get the translations for utterance 17 we could do this:

```
languages = ['English', 'Korean']
massive = get_massive_data(languages)
utterance_17 = get_translations(massive, 17)
```

Don't forget to add a docstring to this function, explaining what it does.

### Signature

`get_translations(massive: pd.DataFrame, utterance_id: int, annotations: bool) -> pandas.DataFrame`

### Arguments

`massive` is a DataFrame containing MASSIVE data. Ideally this should be the output of your `get_massive_data` function. However, if you were unable to pass all of the tests, you can run the "helper cell" below which get a properly formatted DataFrame for you.

`utterance_id` is an integer representing an utterance id

`annotations` is a boolean. If True, the function return values from the 'annotations' column, if False then it return values from the 'text' column. The default is False.

### Return value
This returns a DataFrame where rows are indexed by language, and there is one column called either 'text' or 'annotation' (depending on the argument supplied to the function).


In [None]:
...

In [None]:
grader.check("q2_1")

## EXERCISE 2.1: error handling for get_translations()
rubric = {accuracy: 5}

Update the `get_translations()` function to handle invalid utterance IDs. If an invalid ID is passed, then the function should return an empty dictionary.

In [None]:
...

In [None]:
grader.check("q2_1")

## EXERCISE 3: get_slot_translations()
rubric={autograder:12}

### Description
The utterances in MASSIVE have two kinds of semantic labels: a "scenario", which is the general topic of the whole utterance, and "slots", which are specific words or phrases. Scenarios have their own column in the data. Slots have to be extracted from the text in the "annotations" column, where are they are indicated by square brackets.

For example, utterance 7 in the 'audio' scenario has this annotation:

`pause for [time : ten seconds]	`

This means the slot named 'time' has a value of 'ten seconds'. The slot names are always in English. Utterance 7 in fr-FR is annotated like this:

`pause pendant [time : dix secondes]`

For this exercise, you'll write a function that takes a scenario as input, and returns a table of translations for each slot in each utterance in that scenario. Extracting the slot values requires using regular expressions, which we didn't cover during class, so the code for creating a `slot_name` and `slot_value` column is included in the solution for you. 

Someone would be able to use your function like this:

```
massive = get_massive_data(['English', 'French'])
audio_slots = get_slot_translations(massive, 'audio')
```

Don't forget to add a docstring to your function, explaining what it does.

### Signature
`get_slot_translations(massive: pd.DataFrame, scenario: str) -> pd.DataFrame`


### Arguments
`massive` a pandas dataframe with MASSIVE data. Required, no default value.

`scenario` a string representing one of the scenarios in MASSIVE. Required, no default value.

### Return value
A dataframe indexed by utterance id. The first column is called 'slot_name', and the remaining columns show translations of that slot for each language passed into the function. Some utterances in MASSIVE don't have any slots. Fill any missing values with the string 'no slots'. 

For example, if this function were called with the English, French and German languages in the 'calendar' scenario, the head of the output table would look like this:

<table>
    <tr><th>id</th><th>slot_name</th><th>English</th><th>French</th><th>German</th></tr>
    <tr><td>33</td><td>no slots</td><td>no slots</td><td>no slots</td><td>no slots</td></tr>
    <tr><td>1137</td><td>time</td><td>day</td><td>journée</td><td>tag</td></tr>
    <tr><td>1274</td><td>event_name</td><td>meeting</td><td>réunion</td><td>meeting</td></tr>
    <tr><td>2042</td><td>event_name</td><td>meeting</td><td>réunion</td><td>besprechung</td></tr>
</table>

### Hint
There is already code provided for you to add the `slot_value` and `slot_name` columns to your table. You need to reshape the data into the correct format.

In [None]:
def get_slot_translations(massive, scenario_name):
  #---Don't delete these line! This code finds the necessary slots for you and adds them to the dataframe
  pattern = r'\[([^:\[\]]+) : ([^\[\]]+)\]' 
  massive[['slot_name', 'slot_value']] = massive['annotation'].str.extract(pattern) 
  #---
  ...

In [None]:
grader.check("q3_1")

## Exercise 4 - Massive class
rubric={autograder:10}

One thing that's a little awkward about our functions so far is that we have to "repeat" the massive variable and pass it around

```
massive = get_massive_data(['Korean', 'German'], split='test')
utt_123 = get_translations(massive, 123)
weather_slots = get_slot_translations(massive, 'weather')
```

For this last exercise, you'll define a Massive object, which contains all these functions, and which can remember the languages. The code above would run this like this instead:

```
massive = Massive(['Korean', 'German'], split='test')
utt_123 = massive.get_translations(123)
weather_slots = massive.get_slot_translation('weather')
```

Don't worry if you didn't pass all the tests in earlier exercises. Your grade in this section depends on your ablity to organize code into a class, and you won't be double-graded on the output of any previous function.

Don't forget to add a doctstring to your class, explaining what it does, and listing out methods and attributes.

### Instance attributes

`.languages` A list of strings representing language names

`.split` One of 'test', 'train', 'validation', 'all'

`.data`  A pandas dataframe

### Class attributes

`.all_languages` A list of strings representing all languages for which we have data

### Methods

`get_translations()` Takes an utterance id and 'annotations' boolean as input (as in Exercise 2).

`get_slot_translations()` Take a scenario name as input (as in Exercise 3)

You are not graded on the output of these methods, so don't worry if you had trouble with the previous exercises. The tests for this section will only try to call these functions to check if they exist and accept the appropriate input arguments. There is no test of the return value. You can actually return any truthy value that you want, and you don't actually have to write any real code inside these methods. Do not return a falsey value or `None`, as that will fail autotesting.

### Magic methods

`len(massive)` should return the number of languages that were loaded

Two instances of this class are equal if they have the same set of languages and the same data split (test, train, validation, all)

In [None]:
...

In [None]:
grader.check("q4_1")

## Exercise 5 - Test Driven Development
rubric={accuracy:10, reasoning:10}

<!-- BEGIN QUESTION -->

### Description
For the last exercise, you'll take a 'test-driven development' (TDD) approach to designing a function. This exercise does not involve the MASSIVE dataset from earlier.

You will design a function called `count_words()` which takes text as input, and return a dictionary containing the counts of every word in the text.

### Testing
You must write 5 unit tests for this function using `assert` statements. For each test, provide a short comment, one or two sentences, explaining the purpose of the test. In the spirit of TDD, you should start by writing these tests before coding up your function.

### Signature

`count_words(text: str, punctuation: iterable, ignore_case: bool) -> dict`

### Arguments

- `text`, string. The text to be analyzed. Required, no default value.
- `punctuation`, iterable. This can be a string, or list of strings, containing punctuation symbols to remove from to the text. If a falsey value is passed, then all puncutation is retained. Required, no default value.
- `ignore_case`, boolean. If `True`, all text is converted to lowercase. Default is `False`.


### Return value
The function returns a dictionary where keys are words from the text, and values represent how many times the words appeared in the text. Here are some examples of what the output would look like for different argument values (the order of the output dictionary doesn't matter)

`text = 'The large cat chased the small cat.'`

`print(count_words(text, punctuation=[], ignore_case=False))`

`{'The': 1, 'large': 1, 'cat': 1, 'chased': 1, 'the': 1, 'small': 1, 'cat.': 1}`

`print(count_words(text, punctuation='.', ignore_case=False))`

`{'The': 1, 'large': 1, 'cat': 2, 'chased': 1, 'the': 1, 'small': 1}`

`print(count_words(text, punctuation='.', ignore_case=True))`

`{'the': 2, 'large': 1, 'cat': 2, 'chased': 1, 'small': 1}`




In [None]:
...

In [None]:
...

<!-- END QUESTION -->

## Challenge - the `os` module
rubric={accuracy:5}

<!-- BEGIN QUESTION -->

Update the `load_massive()` function from exercise 1 to add a new `path` argument. This should be a string that represents a path to the MASSIVE language data files. Keep in mind that this argument is only a path to a directory. The specific languages to load are still provided by the `languages` argument. You will need to write some code that joins together the path and the language file names.

You cannot solve this problem through regular string joining, because Windows and Mac systems have different ways of creating file paths. Windows uses a '\' symbol (e.g. `C:\Users\Data\french.csv`) while Mac uses '/' (e.g. `User/Data/french.csv'`). To make your code work across different operating systems, you'll need to look into Python's `os` module. It has a special join method for creating file paths, which can check the operating system and select the correct separator symbol. 

There are copies of the MASSIVE data sitting in a subdirectory called 'data/massive', which you can use for testing this new argument.

In [None]:
import os
...

<!-- END QUESTION -->

