# Data Troubleshooting Example

## This sample notebook is to show you the testing paradigm that you will face on all of the exams.

### The notebook contains Exercise 0 from Notebook 1, Part 2.

#### What we are going to do is write code that purposely fails the exercise in multiple ways, to show you how to troubleshoot data errors on the exams.
***

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import pickle
# with open('resource/asnlib/publicdata/test_cases.pkl', 'rb') as fin:
#     cases = pickle.load(fin)
with open('test_cases.pkl', 'rb') as fin:
    cases = pickle.load(fin)

## First, let's solve the problem correctly.

Consider the following dataset of exam grades, organized as a 2-D table and stored in Python as a "list of lists" under the variable name, `grades`.

In [None]:
grades = [
    # First line is descriptive header. Subsequent lines hold data
    ['Student', 'Exam 1', 'Exam 2', 'Exam 3'],
    ['Thorny', '100', '90', '80'],
    ['Mac', '88', '99', '111'],
    ['Farva', '45', '56', '67'],
    ['Rabbit', '59', '61', '67'],
    ['Ursula', '73', '79', '83'],
    ['Foster', '89', '97', '101']
]

grades

**Exercise 0** (`students_test`: 1 point). Complete the function `get_students` which takes a nested list `grades` as a parameter and reutrns a new list, `students`, which holds the names of the students as they from "top to bottom" in the table. 
- **Note**: the parameter `grades` will be similar to the table above in structure, but the data will be different.

In [None]:
def get_students(grades):
    ###
    ### YOUR CODE HERE
    ###
    
    # Sample solution code below
    students = []

    for i in grades:
        if i[0] != 'Student':   #correct code
            students.append(i[0])
    return students
    

The demo cell below should display `['Thorny', 'Mac', 'Farva', 'Rabbit', 'Ursula', 'Foster']`.

In [None]:
students = get_students(grades)
students

The test cell below will check your solution against several randomly generated test cases. If your solution does not pass the test (or if you're just curious), you can look at the variables used in the latest test run. They are automatically imported for you as part of the test.

- `input_vars` - Dictionary containing all of the inputs to your function. Keys are the parameter names.
- `original_input_vars` - Dictionary containing a copy of all the inputs to your function. This is useful for debugging failures related to your solution modifying the input. Keys are the parameter names.
- `returned_output_vars` - Dictionary containing the outputs your function generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.
- `true_output_vars` - Dictionary containing the outputs your function **should have** generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.

All of the test cells in this notebook will use the same format, and you can expect a similar format on your exams as well.

In [None]:
# `students_test`: Test cell
import nb_1_2_tester
tester = nb_1_2_tester.Tester_1_2_0()
for _ in range(20):
    try:
        tester.run_test(get_students)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
print('Passed. Please submit!')

## Now we will fail the exercise in multiple ways.

## Recall the two tests that are performed. We will demonstrate failure of each test, and how to troubleshoot each.

### First we will fail the test because we have modified the input in some way.

We are going to deliberately write some code that obviously changes the input.

In [None]:
def get_students_modify_input(grades):
    ###
    ### YOUR CODE HERE
    ###
#     print(grades)

# this code modifies the input
# uncomment to walk through
#     grades.append(['This should not be here','50','60','70'])

    # Sample solution code below
    students = []

    for i in grades:
        if i[0] != 'Student':   #correct code
            students.append(i[0])
    return students

In [None]:
students = get_students_modify_input(grades)
students

The test cell below will check your solution against several randomly generated test cases. If your solution does not pass the test (or if you're just curious), you can look at the variables used in the latest test run. They are automatically imported for you as part of the test.

- `input_vars` - Dictionary containing all of the inputs to your function. Keys are the parameter names.
- `original_input_vars` - Dictionary containing a copy of all the inputs to your function. This is useful for debugging failures related to your solution modifying the input. Keys are the parameter names.
- `returned_output_vars` - Dictionary containing the outputs your function generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.
- `true_output_vars` - Dictionary containing the outputs your function **should have** generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.

In [None]:
# `students_test`: Test cell
import nb_1_2_tester
tester = nb_1_2_tester.Tester_1_2_0()
for _ in range(20):
    try:
        tester.run_test(get_students_modify_input)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
print('Passed. Please submit!')

#### So we have failed the exercise, and we can see from the AssertionError that we have modified the input variables in some way. 

What do we do now?

#### As we said before, go back up to your code and find the line that has modified the input variables. There is no need to do any visual inspection at this point.

### Now we will fail the test because our solution data is incorrect. Our first incorrect solution is because we have the wrong data type for our return variable.

We are going to deliberately write some code that returns an incorrect solution with an incorrect data type.

In [None]:
# reset the grades variable back to its original value
grades = [
    # First line is descriptive header. Subsequent lines hold data
    ['Student', 'Exam 1', 'Exam 2', 'Exam 3'],
    ['Thorny', '100', '90', '80'],
    ['Mac', '88', '99', '111'],
    ['Farva', '45', '56', '67'],
    ['Rabbit', '59', '61', '67'],
    ['Ursula', '73', '79', '83'],
    ['Foster', '89', '97', '101']
]

In [None]:
def get_students_incorrect_data_type(grades):
    ###
    ### YOUR CODE HERE
    ###

    # Sample solution code below
    students = []

    for i in grades:
        if i[0] != 'Student':   #correct code
            students.append(i[0])

#     change the data type
#     uncomment to show the error
#     return set(students)

    return students

In [None]:
# demo cell
students = get_students_incorrect_data_type(grades)
students

The test cell below will check your solution against several randomly generated test cases. If your solution does not pass the test (or if you're just curious), you can look at the variables used in the latest test run. They are automatically imported for you as part of the test.

- `input_vars` - Dictionary containing all of the inputs to your function. Keys are the parameter names.
- `original_input_vars` - Dictionary containing a copy of all the inputs to your function. This is useful for debugging failures related to your solution modifying the input. Keys are the parameter names.
- `returned_output_vars` - Dictionary containing the outputs your function generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.
- `true_output_vars` - Dictionary containing the outputs your function **should have** generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.

In [None]:
# `students_test`: Test cell
import nb_1_2_tester
tester = nb_1_2_tester.Tester_1_2_0()
for _ in range(20):
    try:
        tester.run_test(get_students_incorrect_data_type)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
print('Passed. Please submit!')

#### So we have failed the exercise, and we can see from the AssertionError that our returned output does not match the expected (solution) output. 

What do we do now?

#### Now we have to figure out which of the three tests we failed (data type, length, actual data values).

#### We can do a visual inspection, if we think that we can find the differences this way.

Note that we are using `display()` instead of `print()`. Why do you think we are doing this?

In [None]:
# # ONLY UNCOMMENT IF YOU NEED IT!!!!
# # BEWARE THAT THIS COULD GENERATE VOLUMINOUS OUTPUT!!!!!

# uncomment for visual inspection
print('returned_output_vars')
display(returned_output_vars)
print('\ntrue_output_vars')
display(true_output_vars)

### We have not covered dictionaries in the bootcamp yet, but note that the `test case variables` are dictionaries.

1. The key of each dictionary is the name of the returned variable. In this case, the key is `'students'`. If there are multiple returned variables, there will be a separate key for each.

2. The value of each dictionary is the actual variable values.

In [None]:
# addressing the keys and values
print("Keys")
display(true_output_vars.keys())

print('\nValues')
display(true_output_vars['students'])

#### If we are able to find the differences by visual inspection, great. Now go back up the code and figure out what we need to change.

#### But what if we are not able to see the difference(s) by visual inspection? This is VERY COMMON.

So let's programmatically find out the differences. 

For whichever difference is the issue, you must then go back up to your code and work through where you have introduced the error.

In [None]:
# First test, do we have the same data types?
for k_t,v_t in true_output_vars.items():
    
    for k_r,v_r in returned_output_vars.items():

#     # check for datatype (list,dict,set)
        if type(v_t) == type(v_r):
            print('Output data types match\n')
        else:
            print('Output data types do not match')
            print('true_output_vars data type: ',type(v_t))
            print('returned_output_vars data type: ',type(v_r),'\n')

***

### Our next incorrect solution is because the length of our returned output variable is different from the length of the true output variable.

We are going to deliberately write some code that returns an incorrect solution with an incorrect length.

In [None]:
def get_students_incorrect_length(grades):
    ###
    ### YOUR CODE HERE
    ###

    # Sample solution code below
    students = []

    for i in grades:
        if i[0] != 'Student':   #correct code
            students.append(i[0])
            
#         changes the length of the returned data
#         incorrect code, uncomment the next two line to walk through
#         elif i[0] == 'Student':   
#             students.append(i[0])

    return students

In [None]:
# demo cell
students = get_students_incorrect_length(grades)
students

The test cell below will check your solution against several randomly generated test cases. If your solution does not pass the test (or if you're just curious), you can look at the variables used in the latest test run. They are automatically imported for you as part of the test.

- `input_vars` - Dictionary containing all of the inputs to your function. Keys are the parameter names.
- `original_input_vars` - Dictionary containing a copy of all the inputs to your function. This is useful for debugging failures related to your solution modifying the input. Keys are the parameter names.
- `returned_output_vars` - Dictionary containing the outputs your function generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.
- `true_output_vars` - Dictionary containing the outputs your function **should have** generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.

All of the test cells in this notebook will use the same format, and you can expect a similar format on your exams as well.

In [None]:
# `students_test`: Test cell
import nb_1_2_tester
tester = nb_1_2_tester.Tester_1_2_0()
for _ in range(20):
    try:
        tester.run_test(get_students_incorrect_length)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
print('Passed. Please submit!')

In [None]:
# Second test, do we have the same lengths?
for k_t,v_t in true_output_vars.items():
    
    for k_r,v_r in returned_output_vars.items():

#     # check for the length of the solution (lists,dict,set)
        if len(v_t) == len(v_r):
            print('Output lengths match\n')
        elif len(v_t) > len(v_r):
            print('true_output_vars is longer than returned_output_vars.')
            print('Your solution does not have enough data in it.')
            print('true_output_vars length:',len(v_t))
            print('returned_output_vars length:',len(v_r))
        else:
            print('returned_output_vars is longer than true_output_vars.')
            print('Your solution has too much data in it.')
            print('true_output_vars length:',len(v_t))
            print('returned_output_vars length:',len(v_r))

#### When there is length difference your output variables, the most likely reason is that you failed to correctly deal with one of the exercise requirements. You have done one of the following:

    1. Included a value that should have been excluded (your output is longer than what it should be).

    2. Excluded a value that should have been included (your output is shorter than what it should be).

#### At this point, our method is to output/display the two variables and visually compare them, to  find the missing/extra value.


#### Once we know what the extra/missing value is, we must go back to our code and compare it with each of the include/exclude requirements, to see which requirement we did not do correctly.

In this case, we included the key `'Student'` in our output. This key is not the name of an actual student, so it should not have been included. So our output is longer than what it should have been. 

We can go back to our code, find where we have included this, and remove that code.
***

### Our final incorrect solution is because the actual data in our returned output variable is different from the data of the true output variable.

We are going to deliberately write some code that returns an incorrect solution with incorrect data.

### So what if the first two checks pass? This generally means that your code is computing something incorrectly. 

In [None]:
# reset the grades variable back to its original value
grades = [
    # First line is descriptive header. Subsequent lines hold data
    ['Student', 'Exam 1', 'Exam 2', 'Exam 3'],
    ['Thorny', '100', '90', '80'],
    ['Mac', '88', '99', '111'],
    ['Farva', '45', '56', '67'],
    ['Rabbit', '59', '61', '67'],
    ['Ursula', '73', '79', '83'],
    ['Foster', '89', '97', '101']
]

In [None]:
def get_students_incorrect_data(grades):
    ###
    ### YOUR CODE HERE
    ###

    import copy
#     https://www.geeksforgeeks.org/copy-python-deep-copy-shallow-copy/
    grades_copy = copy.deepcopy(grades)   #why do this? this is a deep copy

    # Sample solution code below
    students = []

    for i in grades_copy:
        if i[0] != 'Student':   #correct code
            
#             change the data so that it is different from the solution
#             incorrect code, uncomment the next line to walk through
#             i[0] += "aaa"
            
            students.append(i[0])

    return students

In [None]:
# demo cell
students = get_students_incorrect_data(grades)
students

The test cell below will check your solution against several randomly generated test cases. If your solution does not pass the test (or if you're just curious), you can look at the variables used in the latest test run. They are automatically imported for you as part of the test.

- `input_vars` - Dictionary containing all of the inputs to your function. Keys are the parameter names.
- `original_input_vars` - Dictionary containing a copy of all the inputs to your function. This is useful for debugging failures related to your solution modifying the input. Keys are the parameter names.
- `returned_output_vars` - Dictionary containing the outputs your function generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.
- `true_output_vars` - Dictionary containing the outputs your function **should have** generated. If there are multiple outputs, the keys will match the names mentioned in the exercrise instructions.

All of the test cells in this notebook will use the same format, and you can expect a similar format on your exams as well.

In [None]:
# `students_test`: Test cell
import nb_1_2_tester
tester = nb_1_2_tester.Tester_1_2_0()
for _ in range(20):
    try:
        tester.run_test(get_students_incorrect_data)
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
    except:
        (input_vars, original_input_vars, returned_output_vars, true_output_vars) = tester.get_test_vars()
        raise
print('Passed. Please submit!')

### So how do we deal with this third test scenario? We have several options, but none of them are "easy fixes".

### Ultimately, you are going to have to go back to your code and find/correct what you have written wrong.

1. Visually inspect the `test case variables` and find the difference(s). Then go back to the line(s) of code that produced the difference(s).

2. Go back into your code directly and walk through each step, comparing it the requirement/step that it executes, to see if you can find the error. 

    ***Some examples include:*** 


    1. You have written a math code equation wrong.

    2. You are incorrectly assigning a value.

    3. You have some string manipulation wrong.

    4. You have a logic error.

    5. You have sorted incorrectly (or failed to sort when you should have).

    6. You have incorrectly rounded numeric data (or failed to round when you should have)
    
3. Write a code loop that compares each element individually and outputs an error message when they are not the same. 

    To do this, you must write a loop for the data type of your returned variable. 
    
    This is a (potentially) complex and time-consuming operation, and it requires a solid understanding of the data types.
    
    and as such, we are not going to show this technique. 
    
    With unlimited exam time, we might teach this methodology, but for students with limited programming backgrounds, at this point in the course, we believe that this is not a good use of time and resources.

In [None]:
# uncomment for visual inspection
print('returned_output_vars\n')
print(returned_output_vars)
print('\ntrue_output_vars\n')
print(true_output_vars)

In [None]:
# Another means of printing out the results
for i in returned_output_vars:
    print(returned_output_vars.get(i))
    print(true_output_vars.get(i))

***

## One final note, concerning the demo cells.

### The demo cells are designed to give you some sample data, to help you to get your code up and running. You will see these for every exercise on all of the exams.

## Note that I can pass the demo cell and still fail the test cell!!!

This will be ***ONE OF THE BIGGEST PROBLEMS*** for students on the exams.

Students think that, because they passed the DEMO CELL, it means that they WILL ALSO pass the TEST CELL.

This IS NOT the case, as the DEMO CELL is designed to give you some SAMPLE DATA, to help you get your code up and running.

## The DEMO CELL IS NOT a full test of your code, and you can easily pass the DEMO CELL and FAIL the TEST CELL!!! 

## The TEST CELL is a FULL TEST of your code, and it is much more extensive than the DEMO CELL.

## So be aware that you can pass the DEMO CELL and still fail the TEST CELL.

***
***ON EVERY EXAM***, there will be ***AT LEAST 20 Piazza posts*** from students whose code passes the demo cell and fails the test cell. They will post that there is a BUG in the exam because of this, when in fact, their code is incorrect.

Please be aware of this difference between the demo and test cells.
***

## What are your questions concerning data troubleshooting using the `test case variables`?