Suggestion: add feature to show in detail changes in 1 df over time #30429

christina-zhou-96 · 2019-12-23T17:01:06Z

Code Sample, a copy-pastable example if possible

import pandas as pd
import numpy as np
from tabulate import tabulate

def print_table(df, doc=False):
    """
    Print out a nice looking table.
    
    Set doc to True if you are printing this out for a Microsoft Word (11 x 8) document.

    When doc is False:
    
    Input:
       A  B
    0  1  0
    1  2  0
    2  3  0

    Output:
    +----+-----+-----+
    |    |   A |   B |
    |----+-----+-----|
    |  0 |   1 |   0 |
    |  1 |   2 |   0 |
    |  2 |   3 |   0 |
    +----+-----+-----+
    
    When doc is True:
    Input:
         a   b   c   d   e   f   g   h   i   j   k   l   m   n   p
    0    0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
    1    1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
    2    2   2   2   2   2   2   2   2   2   2   2   2   2   2   2
    3    3   3   3   3   3   3   3   3   3   3   3   3   3   3   3
    4    4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
    5    5   5   5   5   5   5   5   5   5   5   5   5   5   5   5
    6    6   6   6   6   6   6   6   6   6   6   6   6   6   6   6
    7    7   7   7   7   7   7   7   7   7   7   7   7   7   7   7
    8    8   8   8   8   8   8   8   8   8   8   8   8   8   8   8
    9    9   9   9   9   9   9   9   9   9   9   9   9   9   9   9
    10  10  10  10  10  10  10  10  10  10  10  10  10  10  10  10
    11  11  11  11  11  11  11  11  11  11  11  11  11  11  11  11
    12  12  12  12  12  12  12  12  12  12  12  12  12  12  12  12
    13  13  13  13  13  13  13  13  13  13  13  13  13  13  13  13
    14  14  14  14  14  14  14  14  14  14  14  14  14  14  14  14

    Output:
    
    Part 1/3
    +----+-----+-----+-----+-----+-----+-----+
    |    |   a |   b |   c |   d |   e |   f |
    |----+-----+-----+-----+-----+-----+-----|
    |  0 |   0 |   0 |   0 |   0 |   0 |   0 |
    |  1 |   1 |   1 |   1 |   1 |   1 |   1 |
    |  2 |   2 |   2 |   2 |   2 |   2 |   2 |
    |  3 |   3 |   3 |   3 |   3 |   3 |   3 |
    |  4 |   4 |   4 |   4 |   4 |   4 |   4 |
    +----+-----+-----+-----+-----+-----+-----+
    Part 2/3
    +----+-----+-----+-----+-----+-----+-----+
    |    |   g |   h |   i |   j |   k |   l |
    |----+-----+-----+-----+-----+-----+-----|
    |  0 |   0 |   0 |   0 |   0 |   0 |   0 |
    |  1 |   1 |   1 |   1 |   1 |   1 |   1 |
    |  2 |   2 |   2 |   2 |   2 |   2 |   2 |
    |  3 |   3 |   3 |   3 |   3 |   3 |   3 |
    |  4 |   4 |   4 |   4 |   4 |   4 |   4 |
    +----+-----+-----+-----+-----+-----+-----+
    Part 3/3
    +----+-----+-----+-----+
    |    |   m |   n |   p |
    |----+-----+-----+-----|
    |  0 |   0 |   0 |   0 |
    |  1 |   1 |   1 |   1 |
    |  2 |   2 |   2 |   2 |
    |  3 |   3 |   3 |   3 |
    |  4 |   4 |   4 |   4 |
    +----+-----+-----+-----+

    :param df: DataFrame
    :param doc: bool
    :return: None
    """
    def _simple_print(df):
        """
        Single line helper function to print a dataframe.
        :param df: DataFrame
        :return: None
        """
        print(tabulate(df, headers='keys', tablefmt='psql'))

    # Just print with no extra slicing.
    if not doc:
        _simple_print(df)

    # Slice the dataframe to show to Word.
    else:
        # Set the number of columns Word can handle. It typically can handle 6 columns.
        max_columns = 6
        # Find how many tables we will need to print if we cut the columns by 6ths.
        max_tables = int(np.ceil(df.shape[1] / max_columns))
        # For each 6th, print the table.
        for iteration in range(1, max_tables + 1):
            # Print which iteration we're on.
            print(f'Part {iteration}/{max_tables}')
            # Slice the larger dataframe to obtain the current dataframe.
            current_df = df.iloc[0:5, (iteration * max_columns) - max_columns:iteration * max_columns]
            # Print current dataframe.
            _simple_print(current_df)

def test(old_df, new_df, desired_columns, index=None, verbose=False, print_full=True):
    """
    Description
    -----------
    Check for differences (across time, business rule changes, etc.) over the same
    dataframe.

    Recommended to start without verbose, then add it if needed.

    You can toggle whether to print out the sample dataframes (head and tail),
    or not.


    Sample Usage
    ------------
    old_data = {'ID': ['CHA','COC','COF'],
            'Name': ['Chai Tea', 'Cocoa', 'Coffee'],
            'Description': ['Chai Kcup','Hot Chocolate Kcup','Coffee Kcup'],
            'Cost': [2,2,3]}

    new_data = {'ID': ['CHA','COC','COF'],
                'Name': ['Chai Tea', 'Cocoa', 'Coffee'],
                'Description': ['Chai','Hot Chocolate Kcup','Coffee Kcup'],
                'Cost': [2,3,3]}

    old_df = (pd.DataFrame(old_data)
              .set_index('ID'))
    new_df = (pd.DataFrame(new_data)
              .set_index('ID'))

  test(old_df=old_df,
                   new_df=new_df,
                   desired_columns=old_df.columns)


    Output In Console
    -----------------
    Match? False

    Quick Head Test
    Old Records:
    +------+----------+--------------------+--------+
    | ID   | Name     | Description        |   Cost |
    |------+----------+--------------------+--------|
    | CHA  | Chai Tea | Chai Kcup          |      2 |
    | COC  | Cocoa    | Hot Chocolate Kcup |      2 |
    | COF  | Coffee   | Coffee Kcup        |      3 |
    +------+----------+--------------------+--------+
    New Records:
    +------+----------+--------------------+--------+
    | ID   | Name     | Description        |   Cost |
    |------+----------+--------------------+--------|
    | CHA  | Chai Tea | Chai               |      2 |
    | COC  | Cocoa    | Hot Chocolate Kcup |      3 |
    | COF  | Coffee   | Coffee Kcup        |      3 |
    +------+----------+--------------------+--------+
    Quick Tail Test
    Old Records:
    +------+----------+--------------------+--------+
    | ID   | Name     | Description        |   Cost |
    |------+----------+--------------------+--------|
    | CHA  | Chai Tea | Chai Kcup          |      2 |
    | COC  | Cocoa    | Hot Chocolate Kcup |      2 |
    | COF  | Coffee   | Coffee Kcup        |      3 |
    +------+----------+--------------------+--------+
    New Records:
    +------+----------+--------------------+--------+
    | ID   | Name     | Description        |   Cost |
    |------+----------+--------------------+--------|
    | CHA  | Chai Tea | Chai               |      2 |
    | COC  | Cocoa    | Hot Chocolate Kcup |      3 |
    | COF  | Coffee   | Coffee Kcup        |      3 |
    +------+----------+--------------------+--------+


    Quick Cell to Cell Differences
    All differences: (2, 2)
                           Old   New
    ID
    CHA Description  Chai Kcup  Chai
    COC Cost                 2     3


    Parameters
    ----------
    :param old_df: DataFrame
        This could be cases you know to be correct, or cases from last week, etc.

    :param new_df: DataFrame
        This could be the current output created by code you wish to debug or double
        check, or cases from today, etc.

    :param desired_columns: list of strings
        <old_df.columns>

    :param index: string
        name of column to match cases against (currently not implemented)
        <Student ID>

    :return: None
    """
    new_df = new_df[desired_columns]
    old_df = old_df[desired_columns]

    # overall test (returns true or false)
    print(f'Match? {old_df.equals(new_df)}')
    print('\n')

    if print_full:
        # quick head and tail test
        print('Quick Head Test')
        print('Old Records:')
        print_table(old_df.head(5))

        print('New Records:')
        print_table(new_df.head(5))
        print('\n')

        print('Quick Tail Test')
        print('Old Records:')
        print_table(old_df.tail(5))

        print('New Records:')
        print_table(new_df.head(5))
        print('\n')

    def _diff_table(old_df, new_df):
        """
        Create the table that only shows differences (or "errors").


        This is code that manesioz on stackoverflow wrote as a response to my question
        for how to achieve this.

        :param old_df: pandas DataFrame
        :param new_df: pandas DataFrame
        :return: pandas DataFrame
        """
        # create frame of comparison bools
        bool_df = (old_df != new_df).stack()
        diff_table = pd.concat([old_df.stack()[bool_df],
                                new_df.stack()[bool_df]],
                               axis=1)
        diff_table.columns = ["Old", "New"]
        return diff_table

    diff_table = _diff_table(old_df,new_df)
    
    if verbose==False:
        # specific cell by cell test for differences
        print('Quick Cell to Cell Differences')
        print(f'All differences: {diff_table.shape}')
        print(diff_table.head())
        print('\n')

    # Offer a verbose option. Because the following code can produce errors if the
    # dataframes are too different (eg. if the column names differ), we offer the
    # non-verbose option.
    else:
        print('Cell to Cell Differences: All')
        print(f'All differences: {diff_table.shape}')
        print(diff_table)
        print('\n')

Problem description

Often I need to do a quick analysis of the same dataframe over some time. I need to look at the specific cells that changed, and how. Frequently, it's at a volume where it's impossible to just eyeball this. I'm surprised that under pandas' testing suite there's not currently an in-built method to do this.

I ended up using manesioz's code from my stackoverflow question here in my own utility function which is included here. But, I think a feature like this (specifically the output under Quick Cell to Cell Differences) could be really useful for other data analysts too.

More sample output of other times I've used this feature:

Ad hoc, timed request to find any differences in ordering exams over a time period of a month from schools

>>>last_week_df.columns
Index(['BCO', 'Superintendent', 'DBN', 'OrderedExams', 'FLOrderedExams',
       'Algebra I', 'ELA', 'Chemistry', 'Earth Science', 'Global Transition',
       'Global II', 'Algebra II', 'Living Environment', 'US History',
       'Physics', 'Geometry', 'All Exams'],
      dtype='object')
>>>last_week_df.shape
(616, 17)


Cell to Cell Differences: All
All differences: (66, 2)
                        Old  New
36  FLOrderedExams      Yes   No
40  FLOrderedExams      Yes   No
50  FLOrderedExams      Yes   No
66  FLOrderedExams      Yes   No
    ELA                  60   10
    All Exams           110   60
81  OrderedExams         No  Yes
    Algebra I             0   30
    Living Environment    0   15
    Geometry              0   25
    All Exams             0   70
90  FLOrderedExams      Yes   No
165 OrderedExams        Yes   No
    Algebra I            50    0
    ELA                 170    0
    Earth Science        25    0
    Global II            25    0
    Algebra II           25    0
    Living Environment   25    0
    Geometry             15    0
    All Exams           335    0
200 Physics              35   34
    All Exams          1010 1009
282 Living Environment  250  150
    All Exams           750  650
286 ELA                 615  440
    All Exams          1640 1465
432 Algebra I           695  620
    All Exams          1841 1766
540 OrderedExams         No  Yes
    Algebra I             0  130
    ELA                   0   50
    Global Transition     0   30
    Living Environment    0  140
    US History            0   40
    All Exams             0  390
552 OrderedExams         No  Yes
    Algebra I             0   15
    ELA                   0   25
    Chemistry             0    5
    Earth Science         0   15
    Global Transition     0   20
    Algebra II            0   20
    Living Environment    0   10
    US History            0   10
    Geometry              0   10
    All Exams             0  130
560 OrderedExams         No  Yes
    Algebra I             0  200
    Global Transition     0  100
    All Exams             0  300
561 OrderedExams         No  Yes
    Algebra I             0   12
    Living Environment    0   25
    All Exams             0   37
576 OrderedExams         No  Yes
    FLOrderedExams       No  Yes
    Algebra I             0   30
    ELA                   0  140
    All Exams             0  170
606 OrderedExams         No  Yes
    FLOrderedExams       No  Yes
    Earth Science         0   15
    Living Environment    0   15
    Geometry              0   25
    All Exams             0   55

Ad hoc, timed request to update a data file dependency for another file on locations of foreign language personnel at sites

MATHI
Match? False


Cell to Cell Differences: All
All differences: (5, 2)
                     Old  New
Exams go to?                 
14J558       MATHI_K   10    2
             MATHI_R    1    2
20J445       MATHI_K    1    3
28W440       MATHI_K    1    4
29W283       MATHI_K    1    4


USH
Match? False


Cell to Cell Differences: All
All differences: (3, 2)
                     Old  New
Exams go to?                 
02N600       USH_K    1    0
             USH_R    0    1
13J499       USH_K    2    1


PHYS
Match? False


Cell to Cell Differences: All
All differences: (8, 2)
                     Old  New
Exams go to?                 
02N475       PHYS_K    1    0
             PHYS_R    0    1
15J519       PHYS_K    4    2
             PHYS_R    1    2
28W690       PHYS_K    1    3
30W445       PHYS_K    1    2
31T450       PHYS_K    1    0
             PHYS_R    0    1


GLOBALI
Match? False


Cell to Cell Differences: All
All differences: (6, 2)
                     Old  New
Exams go to?                 
19J660       GLOBALI_K    3    1
             GLOBALI_R    1    2
21J540       GLOBALI_K    3    2
             GLOBALI_R    1    2
29W272       GLOBALI_K    1    3
             GLOBALI_R    2    1


LIVING
Match? True


Cell to Cell Differences: All
All differences: (0, 2)
Empty DataFrame
Columns: [Old, New]
Index: []


GLOBALII
Match? False


Cell to Cell Differences: All
All differences: (7, 2)
                     Old  New
Exams go to?                 
13J499       GLOBALII_K    6    2
             GLOBALII_R    1    2
19J660       GLOBALII_K    4    2
21J540       GLOBALII_K    2    3
28W620       GLOBALII_K    2    4
             GLOBALII_R    2    1
29W272       GLOBALII_K    2    4

The text was updated successfully, but these errors were encountered:

jreback · 2019-12-26T15:01:36Z

@christina-zhou-96 I think adding DataFrame.differences(self, other: pd.DataFrame, verbose: bool=False) would make sense (pretty much your _diff_table)

Would need a full doc-string w/o examples, as well as a section in the main docs.

cc @pandas-dev/pandas-core

topper-123 · 2019-12-26T16:38:07Z

I'm +1 on something like DataFrame.differences, though it should work for Series also.

Would unstacked output be more readable (i.e. keeping the index as-is and add the New/Old level to the columns)?:

>>> df = pd.DataFrame({"A": [2,2,1,1], "B": [3,3,2,2]}) 
>>> df2 = df * 2
>>> df2.iloc[0, 1] = df.iloc[0, 1]
>>> df.differences(df2)
     A         B
   New  Old  New  Old
0  4.0  2.0  NaN  NaN
1  4.0  2.0  6.0  3.0
2  2.0  1.0  4.0  2.0
3  2.0  1.0  4.0  2.0

This would also make some further ops on the data a bit more natural, e.g. diff_df["A"].pipe(lambda x: x.new - x.old).

fujiaxiang · 2019-12-29T13:44:41Z

I will take on this. Will make a draft full doc-string first. We can then discuss the design there.

fujiaxiang · 2019-12-29T13:44:54Z

take

jreback added Enhancement Datetime Datetime data dtype labels Dec 26, 2019

jreback added this to the Contributions Welcome milestone Dec 26, 2019

github-actions bot assigned fujiaxiang Dec 29, 2019

This was referenced Jan 9, 2020

Feature Request: context manager in which NaN equals NaN #30834

Closed

ENH: Added DataFrame.compare and Series.compare (GH30429) #30852

Merged

jreback modified the milestones: Contributions Welcome, 1.1 May 12, 2020

jreback closed this as completed in #30852 May 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: add feature to show in detail changes in 1 df over time #30429

Suggestion: add feature to show in detail changes in 1 df over time #30429

christina-zhou-96 commented Dec 23, 2019

jreback commented Dec 26, 2019

topper-123 commented Dec 26, 2019

fujiaxiang commented Dec 29, 2019

fujiaxiang commented Dec 29, 2019

Suggestion: add feature to show in detail changes in 1 df over time #30429

Suggestion: add feature to show in detail changes in 1 df over time #30429

Comments

christina-zhou-96 commented Dec 23, 2019

Code Sample, a copy-pastable example if possible

Problem description

jreback commented Dec 26, 2019

topper-123 commented Dec 26, 2019

fujiaxiang commented Dec 29, 2019

fujiaxiang commented Dec 29, 2019