# Troubleshooting pandas dataframes for the exams

In [None]:
# !wget https://raw.githubusercontent.com/gt-cse-6040/skills_oh_week_07/main/who3.csv
# !wget https://raw.githubusercontent.com/gt-cse-6040/skills_oh_week_07/main/who3_soln.csv

In [None]:
import pandas as pd

In [None]:
# bring in the two dataframes to work with
who3_soln = pd.read_csv('who3_soln.csv')
who3 = pd.read_csv('who3.csv')

## What we want to do is show some techniques for troubleshooting when our dataframe does not match the solution dataframe. 

## The exam test cells use the pandas function, `assert_frame_equal()` to test for correctness.

### This notebook with introduce the function, how it operates, and its usage on the exams.

### Then we will look at some troubleshooting techniques that students can use, for both the homework notebooks and the exams.

### Steps for today:

1. We have the who3 and who3_soln dataframes, from Notebook 7.

2.  We will make a copy of the who3 dataframe and manipulate it a little bit.

3.  Also run the function to show that they are equivalent, at the start.

#### Let's make a copy of our code output df to work with.

In [None]:
who3_bootcamp = who3.copy()

## How does the exam testing operate?

## The test case variables generate 4 pandas dataframes to test, to evaluate your function.

### There are 2 tests, one of which uses the assert_frame_equal() function to determine if your solution is correct. The other test MAY use this function, if the input is also a dataframe.

### The first test checks the variables that are input to your function, and compares them to the same variables after your function has been executed. The test determines if you have altered the input variables in any way. If this test fails, then the only thing that you can do is go back into your code and see where you have altered the input variable. Some ways might be:

1. Worked directly on the input variable, instead of making a copy.

2. Made a copy of the variable, when you needed to make a deep copy, because the variable contains nested data.

3. Added or removed elements from the input variable in your code, as you are executing the function requirements.

#### Documentation for copy and deep copy:  https://docs.python.org/3/library/copy.html

In [None]:
#deep copy versus copy syntax
from copy import copy
from copy import deepcopy

who3_copy = copy(who3)

who3_deepcopy = deepcopy(who3)

display(who3_copy)
display(who3_deepcopy)

***

#### While we are not going to go into detail here, if your data is nested in any way, only `deepcopy` will make a full copy of the data. 

### Bottom line, you will never be wrong if you use `deepcopy()` whenever you need to make a copy of your input data.

***

### The second test checks the output of your function (a pandas dataframe in this case) against the pandas dataframe that is the solution. The test uses the assert_frame_equal() function to do this, after some dataframe manipulation.

#### Here is the function documentation:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.testing.assert_frame_equal.html

#### Let's take a look at the function, in the documentation.

### What are the checks that assert_all_equal() does?
 
1. Checks for data types in the two dfs -- `check_dtype, check_index_type, and check_column_type parameters`
2. Checks the number of rows and columns in the two dfs. `No direct parameter for this.`
3. Checks the column names in the two dfs. `check_names parameter`
4. Finally, compares the actual data in the two dfs.  `No direct parameter for this.`

#### Let's look at each one individually, and how to troubleshoot.

## You can use this Pandas function to do your testing and troubleshooting.

## To run your tests, all you need to do is import the function and then call it, passing in the return_output_variables and true_output variables.

### We will now set up 4 dataframe examples, that exercise the 4 types of tests that assert_all_equal() does. The changed dataframes will all be named for the type of change/test we are exercising.

#### What we will do is change each df and run some code that shows you how to do individual checks for this test.

#### Then we will show how you can do this using the assert_all_equal() function, and let it tell you what has failed.

In [None]:
# import the function, students will need to do this on the exams
from pandas.testing import assert_frame_equal

### For our first example, we will change a datatype. 

https://www.geeksforgeeks.org/get-the-datatypes-of-columns-of-a-pandas-dataframe/

In [None]:
who3_bootcamp_type = who3_bootcamp.copy()
who3_bootcamp_type["count"] = who3_bootcamp_type["count"].astype(float)  

In [None]:
# a,b = who3_bootcamp_type,who3_soln   # data type differences of columns
# result = assert_frame_equal(a,b)
# result

### So we have failed the exercise, and we can see from the AssertionError that our returned output does not match the expected (solution) output. 

### What do we do now?

The **AssertionError** message from the function is very good, and we can see that the datatypes for the column "count" are different, and what they are.

Note that the "left" dataframe is always the first one that you passed in as a parameter, and the "right" dataframe is the second parameter.

### So we would go back to our code, find the line(s) in which we are assigning the incorrect data type, and change them.

In [None]:
# display the column data types
# display(who3_bootcamp_type.dtypes)
# display(who3_soln.dtypes)

### For our second example, we change the number of rows in the dataframe.
https://www.geeksforgeeks.org/count-the-number-of-rows-and-columns-of-pandas-dataframe/

In [None]:
who3_bootcamp_drop = who3_bootcamp.copy()
# Dropping last 10 rows using drop
who3_bootcamp_drop.drop(who3_bootcamp_drop.tail(10).index,inplace = True)

#note that printing the head() of the df does not tell you there is a difference
print(who3_bootcamp_drop.head())
print(who3_soln.head())

In [None]:
# # differences in the number of rows of data
# a,b = who3_bootcamp_drop,who3_soln   
# result = assert_frame_equal(a,b)
# result

### So we have failed the exercise, and we can see from the AssertionError that our returned output does not match the expected (solution) output. 

### What do we do now?

Again, the **AssertionError** message from the function is very good, and we can see that the dataframes have different shapes, in their number of rows and columns.

### When there is length difference your output variables, the most likely reason is that you failed to correctly deal with one of the exercise requirements. You have done one of the following:

   **If the number of rows is different:**

    1. Included a value that should have been excluded (your output is longer than what it should be).

    2. Excluded a value that should have been included (your output is shorter than what it should be).
    
   **If the number of columns is different:**
    
    1. Included a column that should have been excluded (your output is wider than what it should be).

    2. Excluded a column that should have been included (your output is narrower than what it should be).

### At this point, we could output/display the two variables and visually compare them, to see if we can find the missing/extra value.

### But the `better method` is to go back to the requirements, your strategy, and your code implementing the strategy, to see if you are handling the edge cases correctly.

   **For rows:**

    1. If you are using `nlargest()` or `nsmallest()`, are you accounting for the ties correctly?
    
    2. If you are doing string manipulation, are you accounting for all of the inclusive/exclusive requirements for elements with the string values that you are comparing for?
    
    3. If you are doing summarizations, have you included all of the required ones (min/max/mean/median), or for time series data, do you have the correct number of periods?
    
   **For columns:**
    
    1. Compare the required columns to your columns and add/remove as necessary.

### Once we know what the extra/missing value/column is, we must go back to our code and compare it with each of the include/exclude requirements, to see which requirement we did not do correctly.

In [None]:
# # # fetching the number of rows and columns
# rows_OH = who3_bootcamp_drop.shape[0]
# cols_OH = who3_bootcamp_drop.shape[1]

# rows_soln = who3_soln.shape[0]
# cols_soln = who3_soln.shape[1]
  
# # displaying the number of rows and columns
# print("Rows OH: " + str(rows_OH))
# print("Columns OH: " + str(cols_OH))
# print("Rows soln: " + str(rows_soln))
# print("Columns soln: " + str(cols_soln))

### For our third example, we will change a couple of column names.

In [None]:
# # Change a couple of the column names in the new df, just to generate the error.
who3_bootcamp_colname = who3_bootcamp.copy()
who3_bootcamp_colname.rename(columns = {'country':'Country'}, inplace = True)
who3_bootcamp_colname.rename(columns = {'age_group':'age-group'}, inplace = True)

In [None]:
# a,b = who3_bootcamp_colname,who3_soln   # column names different
# result = assert_frame_equal(a,b)
# result

### So we have failed the exercise, and we can see from the AssertionError that our returned output does not match the expected (solution) output. 

### What do we do now?

Again, the **AssertionError** message from the function is very good, and we can see that the column names do not match.

#### The error message output is very good, in that it lists the column names, one above the other, for a direct visual comparison. Find the difference visually and then go back to the code and change it appropriately.

#### However, if we have a very wide data frame, with many columns, the displayed output will not directly line up, so we provide the code below to loop over the column names and direclyt output the differences.

#### Here is some template code to check the column names directly, you can use it if you would like.

In [None]:
# check column names directly
my_list = who3_bootcamp_colname.columns.values.tolist() #change to the true output variables name
soln_list = who3_soln.columns.values.tolist()    #change to the returned output variables name

for i,col_name in enumerate(my_list):
    if col_name != soln_list[i]:
        print('Column names do not match')
        print('My column name: ', col_name)
        print('Soln column nm: ', soln_list[i])

#### Finally, let's change some values in the df.

In [None]:
# # let's change a few values in our bootcamp dataframe
who3_bootcamp_data = who3_bootcamp.copy()
# # make some data changes
who3_bootcamp_data.at[0, 'count'] = 6
who3_bootcamp_data.at[1, 'count'] = 2
who3_bootcamp_data.at[2, 'year'] = 2011
who3_bootcamp_data.at[3, 'year'] = 2012

In [None]:
# a,b = who3_bootcamp_data,who3_soln   # data differences
# result = assert_frame_equal(a,b)
# result

# Notice that we changed "count" and "year" above, but the error only returns for "count".
# The function returns the first column with errors, and you may need to run this multiple
# times, if you have multiple columns with data errors.

### So we have failed the exercise, and we can see from the AssertionError that our returned output does not match the expected (solution) output. 

### What do we do now?

Again, the **AssertionError** message from the function is not quite as direct as before, but we can see that some values in the column `count` are different.

### But wait, we changed the values in two of the columns, and the error message is only giving one column as having incorrect values? Is the test wrong?

#### No, the test is functioning correctly. The `assert_frame_equal()` function will return the first column that has a difference in values, with the error on that column. If this is the error, you may end up fixing the error in the first column and then finding out that an additional column(s) may also be incorrect.

### What we recommend is running the pandas  function `compare()`, to give all of the rows with different values. This is a very good function.

Let's look at the documentation:  https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.compare.html

In [None]:
# # note that the "self" df is the df that we are running compare on
# # and the "other" df is the one in the parenthesis (we are comparing to)
display(who3_soln.compare(who3_bootcamp_data))

# # align the differences on rows, just a different way of looking at the comparison
display(who3_soln.compare(who3_bootcamp_data,align_axis=0))

### Now that you know the differences, you are going to have to go back to your code and find/correct what you have written wrong.

1. Go back to the line(s) of code that produced the difference(s).

2. Go back into your code directly and walk through each step, comparing it the requirement/step that it executes, to see if you can find the error. 

    ***Some examples include:*** 


    1. You have written a math code equation wrong.

    2. You are incorrectly assigning a value.

    3. You have some string manipulation wrong.

    4. You have a logic error.

    5. You have sorted incorrectly (or failed to sort when you should have).

    6. You have incorrectly rounded numeric data (or failed to round when you should have)
    

## One final note, concerning the demo cells.

### The demo cells are designed to give you some sample data, to help you to get your code up and running. You will see these for every exercise on all of the exams.

## Note that I can pass the demo cell and still fail the test cell!!!

This will be ***ONE OF THE BIGGEST PROBLEMS*** for students on the exams.

Students think that, because they passed the DEMO CELL, it means that they WILL ALSO pass the TEST CELL.

This IS NOT the case, as the DEMO CELL is designed to give you some SAMPLE DATA, to help you get your code up and running.

## The DEMO CELL IS NOT a full test of your code, and you can easily pass the DEMO CELL and FAIL the TEST CELL!!! 

## The TEST CELL is a FULL TEST of your code, and it is much more extensive than the DEMO CELL.

## So be aware that you can pass the DEMO CELL and still fail the TEST CELL.

***
***ON EVERY EXAM***, there will be ***AT LEAST 20 Piazza posts*** from students whose code passes the demo cell and fails the test cell. They will post that there is a BUG in the exam because of this, when in fact, their code is incorrect.

Please be aware of this difference between the demo and test cells.
***

## What are your questions concerning data troubleshooting of pandas dataframes?