_Main topics covered during today's session:_

Previous NB:

1. **Pandas Functions:**
    
    a. Index operations
    
    b. Concat
    
    c. Merge
    
    d. Groupby and Aggregation

This NB:

2. **Troubleshooting pandas dataframes**

*****************************************

# Troubleshooting pandas dataframes

*****************************************

## What we want to do is show some techniques for troubleshooting when our dataframe does not match the solution dataframe. 

### The pandas dataframe testing in the homework and exam notebooks is typically done with the tibbles_are_equivalent() function, but it can also be done other means.

*All of the troubleshooting below applies for troubleshooting dataframes, no matter the test method.*


### We have already introduced a notebook for data troubleshooting back in module 1. While our coverage of that notebook focused on the standard Python data objects, it also includes the code for troubleshooting pandas dataframes, so we will reintroduce that notebook, with emphasis the pandas troubleshooting code.

*Steps for today:*

1. We have the who3 and who3_soln dataframes, from Notebook 7.

2.  We will make a copy of the who3 dataframe and manipulate it a little bit.

3.  Also run the function to show that they are equivalent, at the start.

#### First, let's import the pandas library and define the functions.

The two functions below, canonicalize_tibble() and tibbles_are_equivalent(), are from the Notebook 7 solution.

In [None]:
import pandas as pd

### Remember that `Y` is in canonical order if it has the following properties.

1. The variables appear in sorted order by name, ascending from left to right.
2. The rows appear in lexicographically sorted order by variable, ascending from top to bottom.
3. The row labels (`Y.index`) go from 0 to `n-1`, where `n` is the number of observations.

In [None]:
def canonicalize_tibble(X):
    # Enforce Property 1:
    var_names = sorted(X.columns)
    Y = X[var_names].copy()

    ### BEGIN SOLUTION
    # Enforce Property 2:
    Y.sort_values(by=var_names, inplace=True)
    
    # Enforce Property 3:
    Y.reset_index(drop=True, inplace=True)
#     Y.sort_index(inplace=True)
    ### END SOLUTION
    return Y

#### Below are two implementations of the the tibbles_are_equivalent() function.

#### The two functions differ slightly in how they evaluate the dataframes, but they reach the same True/False conclusion.

1. If the dataframes are equivalent, they will both return True.


2. If the dataframes are NOT equivalent, the first function will either error out or return a false value.
    
    a. If the differences in the df's are related to data type, column name, or size differences, the funcion will error out, with a descriptive error message that you can troubleshoot.
    
    b. If the differences in the df's are purely because of data value differences, the function will return a False value.


3. If the dataframes are NOT equivalent, the second function will return a False value, no matter the reason.


4. The first function is the one that will be in notebooks and (potentially) exams. We present the second one as an alternative solution and troubleshooting tool for you. It enables you to run functions later in the notebook, past cells that you know are failing, without having to stop and restart.

In [None]:
def tibbles_are_equivalent(A, B):
    """Given two tidy tables ('tibbles'), returns True iff they are
    equivalent.
    """
    ### BEGIN SOLUTION
    A_hat = canonicalize_tibble(A)
    B_hat = canonicalize_tibble(B)
    equal = (A_hat == B_hat)
    return equal.all().all()
    ### END SOLUTION

    
def tibbles_are_equivalent_T(A, B):
    #Alternative solution
    A_copy = A.copy()
    B_copy = B.copy()
    
    
    A_canon = canonicalize_tibble(A_copy)
    B_canon = canonicalize_tibble(B_copy)
    
    return A_canon.equals(B_canon)
    
    

In [None]:
# bring in the two dataframes to work with
who3_soln = pd.read_csv('who3_soln.csv')
who3 = pd.read_csv('who3.csv')

#### Let's show that we have equivalent dataframes to start with.

In [None]:
who3_skillsOH = who3.copy()
print(who3_skillsOH.head())
print(tibbles_are_equivalent(who3_skillsOH, who3_soln))
print(tibbles_are_equivalent_T(who3_skillsOH, who3_soln))

### What are some of the common reasons that the tibbles_are_equivalent() function fails.
 
1. Data types are different (int32 vs int64, for example) -- Use pd.dtypes() function
2. Number of rows of data is different -- Use Pandas shape() function
3. Column names are different (capital versus small letters, underscore versus dash, etc) -- Use Pandas compare() function
4. Actual data is different -- Use Pandas compare() function

#### Let's look at each one individually, and how to troubleshoot.

#### Remember that the function DOES NOT tell you WHY it failed, only that it failed.

#### First, let's change a datatype or two and see what happens
#### Starting with the who3_skillsOH dataframe again
https://www.geeksforgeeks.org/get-the-datatypes-of-columns-of-a-pandas-dataframe/

In [None]:
display(who3_skillsOH.dtypes)
display(who3_soln.dtypes)

In [None]:
who3_skillsOH_type = who3_skillsOH.copy()
# who3_skillsOH_type["count"] = who3_skillsOH_type["count"].astype(int)  #makes it int32 on my laptop
who3_skillsOH_type["count"] = who3_skillsOH_type["count"].astype("int64")

print(tibbles_are_equivalent(who3_skillsOH_type,who3_soln))
print(tibbles_are_equivalent_T(who3_skillsOH_type,who3_soln))

In [None]:
display(who3_skillsOH_type.dtypes)
display(who3_soln.dtypes)

#### Do we have the same number of data rows?
https://www.geeksforgeeks.org/count-the-number-of-rows-and-columns-of-pandas-dataframe/

In [None]:
who3_skillsOH_drop = who3_skillsOH.copy()
# Dropping last 10 rows using drop
who3_skillsOH_drop.drop(who3_skillsOH_drop.tail(10).index,inplace = True)

print(who3_skillsOH_drop.head())
print(who3_soln.head())

my_list = who3_skillsOH_drop.columns.values.tolist()
print(my_list)
soln_list = who3_soln.columns.values.tolist()
print(soln_list)



print(tibbles_are_equivalent_T(who3_skillsOH_drop, who3_soln))
# print(tibbles_are_equivalent(who3_skillsOH_drop, who3_soln))  #note that this errors out, as expected

In [None]:
# # fetching the number of rows and columns
rows_OH = who3_skillsOH_drop.shape[0]
cols_OH = who3_skillsOH_drop.shape[1]

rows_soln = who3_soln.shape[0]
cols_soln = who3_soln.shape[1]
  
# displaying the number of rows and columns
print("Rows OH: " + str(rows_OH))
print("Columns OH: " + str(cols_OH))
print("Rows soln: " + str(rows_soln))
print("Columns soln: " + str(cols_soln))

#### Now let's look at different column names and data differences.
#### A good way to determine differences is to run the Pandas compare() function.
https://www.geeksforgeeks.org/how-to-compare-two-dataframes-with-pandas-compare/

#### Note that running compare() returns no results when the data frames are identical.

In [None]:
print("tibbles result")
print(tibbles_are_equivalent(who3_skillsOH, who3_soln))
print("\ncompare result")
who3_soln.compare(who3_skillsOH)  

In [None]:
# # Change a couple of the column names in the new df, just to generate the error.
who3_skillsOH_colname = who3_skillsOH.copy()
who3_skillsOH_colname.rename(columns = {'country':'Country'}, inplace = True)
who3_skillsOH_colname.rename(columns = {'age_group':'age-group'}, inplace = True)
print('Troubleshooting function')
print(tibbles_are_equivalent_T(who3_skillsOH_colname, who3_soln))
print('solution function')
# print(tibbles_are_equivalent(who3_skillsOH_colname, who3_soln))  # errors out, as expected

In [None]:
# # generate an error using compare().
# # because the column names are different
# who3_soln.compare(who3_skillsOH_colname)  #errors out, uncomment to run

In [None]:
my_list = who3_skillsOH_colname.columns.values.tolist()
print(my_list)
soln_list = who3_soln.columns.values.tolist()
print(soln_list)

#### Finally, let's see how to determine if there are value differences
#### Again, use the Pandas compare() function

In [None]:
# # let's change a few values in our skillsOH dataframe
# # Show that the df copy is the same
who3_skillsOH_data = who3_skillsOH.copy()
tibbles_are_equivalent(who3_skillsOH_data,who3_soln) #to show that they are the same to start with

In [None]:
# # make some data changes
who3_skillsOH_data.at[0, 'count'] = 6
who3_skillsOH_data.at[1, 'count'] = 2
who3_skillsOH_data.at[2, 'year'] = 2011
who3_skillsOH_data.at[3, 'year'] = 2012

In [None]:
print(tibbles_are_equivalent(who3_skillsOH_data,who3_soln))
print(tibbles_are_equivalent_T(who3_skillsOH_data,who3_soln))

In [None]:
who3_soln.compare(who3_skillsOH_data)
# # note that the "self" df is the df that we are running compare on
# # and the "other" df is the one in the parenthesis (we are comparing to)

In [None]:
# # align the differences on rows, just a different way of looking at the comparison
who3_soln.compare(who3_skillsOH_data,align_axis=0)

### So how to troubleshoot with tibbles_are_equivalent() returns FALSE?
#### (Or you are troubleshooting your dataframe against the solution dataframe)

#### 1.  Check the data types.
#### 2.  Check the number of rows and columns.
#### 3.  Run compare() for column name and actual data differences.