You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
importpandasaspdimportnumpyasnpfromtabulateimporttabulatedefprint_table(df, doc=False):
""" Print out a nice looking table. Set doc to True if you are printing this out for a Microsoft Word (11 x 8) document. When doc is False: Input: A B 0 1 0 1 2 0 2 3 0 Output: +----+-----+-----+ | | A | B | |----+-----+-----| | 0 | 1 | 0 | | 1 | 2 | 0 | | 2 | 3 | 0 | +----+-----+-----+ When doc is True: Input: a b c d e f g h i j k l m n p 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 Output: Part 1/3 +----+-----+-----+-----+-----+-----+-----+ | | a | b | c | d | e | f | |----+-----+-----+-----+-----+-----+-----| | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | +----+-----+-----+-----+-----+-----+-----+ Part 2/3 +----+-----+-----+-----+-----+-----+-----+ | | g | h | i | j | k | l | |----+-----+-----+-----+-----+-----+-----| | 0 | 0 | 0 | 0 | 0 | 0 | 0 | | 1 | 1 | 1 | 1 | 1 | 1 | 1 | | 2 | 2 | 2 | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | 4 | 4 | 4 | +----+-----+-----+-----+-----+-----+-----+ Part 3/3 +----+-----+-----+-----+ | | m | n | p | |----+-----+-----+-----| | 0 | 0 | 0 | 0 | | 1 | 1 | 1 | 1 | | 2 | 2 | 2 | 2 | | 3 | 3 | 3 | 3 | | 4 | 4 | 4 | 4 | +----+-----+-----+-----+ :param df: DataFrame :param doc: bool :return: None """def_simple_print(df):
""" Single line helper function to print a dataframe. :param df: DataFrame :return: None """print(tabulate(df, headers='keys', tablefmt='psql'))
# Just print with no extra slicing.ifnotdoc:
_simple_print(df)
# Slice the dataframe to show to Word.else:
# Set the number of columns Word can handle. It typically can handle 6 columns.max_columns=6# Find how many tables we will need to print if we cut the columns by 6ths.max_tables=int(np.ceil(df.shape[1] /max_columns))
# For each 6th, print the table.foriterationinrange(1, max_tables+1):
# Print which iteration we're on.print(f'Part {iteration}/{max_tables}')
# Slice the larger dataframe to obtain the current dataframe.current_df=df.iloc[0:5, (iteration*max_columns) -max_columns:iteration*max_columns]
# Print current dataframe._simple_print(current_df)
deftest(old_df, new_df, desired_columns, index=None, verbose=False, print_full=True):
""" Description ----------- Check for differences (across time, business rule changes, etc.) over the same dataframe. Recommended to start without verbose, then add it if needed. You can toggle whether to print out the sample dataframes (head and tail), or not. Sample Usage ------------ old_data = {'ID': ['CHA','COC','COF'], 'Name': ['Chai Tea', 'Cocoa', 'Coffee'], 'Description': ['Chai Kcup','Hot Chocolate Kcup','Coffee Kcup'], 'Cost': [2,2,3]} new_data = {'ID': ['CHA','COC','COF'], 'Name': ['Chai Tea', 'Cocoa', 'Coffee'], 'Description': ['Chai','Hot Chocolate Kcup','Coffee Kcup'], 'Cost': [2,3,3]} old_df = (pd.DataFrame(old_data) .set_index('ID')) new_df = (pd.DataFrame(new_data) .set_index('ID')) test(old_df=old_df, new_df=new_df, desired_columns=old_df.columns) Output In Console ----------------- Match? False Quick Head Test Old Records: +------+----------+--------------------+--------+ | ID | Name | Description | Cost | |------+----------+--------------------+--------| | CHA | Chai Tea | Chai Kcup | 2 | | COC | Cocoa | Hot Chocolate Kcup | 2 | | COF | Coffee | Coffee Kcup | 3 | +------+----------+--------------------+--------+ New Records: +------+----------+--------------------+--------+ | ID | Name | Description | Cost | |------+----------+--------------------+--------| | CHA | Chai Tea | Chai | 2 | | COC | Cocoa | Hot Chocolate Kcup | 3 | | COF | Coffee | Coffee Kcup | 3 | +------+----------+--------------------+--------+ Quick Tail Test Old Records: +------+----------+--------------------+--------+ | ID | Name | Description | Cost | |------+----------+--------------------+--------| | CHA | Chai Tea | Chai Kcup | 2 | | COC | Cocoa | Hot Chocolate Kcup | 2 | | COF | Coffee | Coffee Kcup | 3 | +------+----------+--------------------+--------+ New Records: +------+----------+--------------------+--------+ | ID | Name | Description | Cost | |------+----------+--------------------+--------| | CHA | Chai Tea | Chai | 2 | | COC | Cocoa | Hot Chocolate Kcup | 3 | | COF | Coffee | Coffee Kcup | 3 | +------+----------+--------------------+--------+ Quick Cell to Cell Differences All differences: (2, 2) Old New ID CHA Description Chai Kcup Chai COC Cost 2 3 Parameters ---------- :param old_df: DataFrame This could be cases you know to be correct, or cases from last week, etc. :param new_df: DataFrame This could be the current output created by code you wish to debug or double check, or cases from today, etc. :param desired_columns: list of strings <old_df.columns> :param index: string name of column to match cases against (currently not implemented) <Student ID> :return: None """new_df=new_df[desired_columns]
old_df=old_df[desired_columns]
# overall test (returns true or false)print(f'Match? {old_df.equals(new_df)}')
print('\n')
ifprint_full:
# quick head and tail testprint('Quick Head Test')
print('Old Records:')
print_table(old_df.head(5))
print('New Records:')
print_table(new_df.head(5))
print('\n')
print('Quick Tail Test')
print('Old Records:')
print_table(old_df.tail(5))
print('New Records:')
print_table(new_df.head(5))
print('\n')
def_diff_table(old_df, new_df):
""" Create the table that only shows differences (or "errors"). This is code that manesioz on stackoverflow wrote as a response to my question for how to achieve this. :param old_df: pandas DataFrame :param new_df: pandas DataFrame :return: pandas DataFrame """# create frame of comparison boolsbool_df= (old_df!=new_df).stack()
diff_table=pd.concat([old_df.stack()[bool_df],
new_df.stack()[bool_df]],
axis=1)
diff_table.columns= ["Old", "New"]
returndiff_tablediff_table=_diff_table(old_df,new_df)
ifverbose==False:
# specific cell by cell test for differencesprint('Quick Cell to Cell Differences')
print(f'All differences: {diff_table.shape}')
print(diff_table.head())
print('\n')
# Offer a verbose option. Because the following code can produce errors if the# dataframes are too different (eg. if the column names differ), we offer the# non-verbose option.else:
print('Cell to Cell Differences: All')
print(f'All differences: {diff_table.shape}')
print(diff_table)
print('\n')
Problem description
Often I need to do a quick analysis of the same dataframe over some time. I need to look at the specific cells that changed, and how. Frequently, it's at a volume where it's impossible to just eyeball this. I'm surprised that under pandas' testing suite there's not currently an in-built method to do this.
I ended up using manesioz's code from my stackoverflow question here in my own utility function which is included here. But, I think a feature like this (specifically the output under Quick Cell to Cell Differences) could be really useful for other data analysts too.
More sample output of other times I've used this feature:
Ad hoc, timed request to find any differences in ordering exams over a time period of a month from schools
>>>last_week_df.columns
Index(['BCO', 'Superintendent', 'DBN', 'OrderedExams', 'FLOrderedExams',
'Algebra I', 'ELA', 'Chemistry', 'Earth Science', 'Global Transition',
'Global II', 'Algebra II', 'Living Environment', 'US History',
'Physics', 'Geometry', 'All Exams'],
dtype='object')
>>>last_week_df.shape
(616, 17)
Cell to Cell Differences: All
All differences: (66, 2)
Old New
36 FLOrderedExams Yes No
40 FLOrderedExams Yes No
50 FLOrderedExams Yes No
66 FLOrderedExams Yes No
ELA 60 10
All Exams 110 60
81 OrderedExams No Yes
Algebra I 0 30
Living Environment 0 15
Geometry 0 25
All Exams 0 70
90 FLOrderedExams Yes No
165 OrderedExams Yes No
Algebra I 50 0
ELA 170 0
Earth Science 25 0
Global II 25 0
Algebra II 25 0
Living Environment 25 0
Geometry 15 0
All Exams 335 0
200 Physics 35 34
All Exams 1010 1009
282 Living Environment 250 150
All Exams 750 650
286 ELA 615 440
All Exams 1640 1465
432 Algebra I 695 620
All Exams 1841 1766
540 OrderedExams No Yes
Algebra I 0 130
ELA 0 50
Global Transition 0 30
Living Environment 0 140
US History 0 40
All Exams 0 390
552 OrderedExams No Yes
Algebra I 0 15
ELA 0 25
Chemistry 0 5
Earth Science 0 15
Global Transition 0 20
Algebra II 0 20
Living Environment 0 10
US History 0 10
Geometry 0 10
All Exams 0 130
560 OrderedExams No Yes
Algebra I 0 200
Global Transition 0 100
All Exams 0 300
561 OrderedExams No Yes
Algebra I 0 12
Living Environment 0 25
All Exams 0 37
576 OrderedExams No Yes
FLOrderedExams No Yes
Algebra I 0 30
ELA 0 140
All Exams 0 170
606 OrderedExams No Yes
FLOrderedExams No Yes
Earth Science 0 15
Living Environment 0 15
Geometry 0 25
All Exams 0 55
Ad hoc, timed request to update a data file dependency for another file on locations of foreign language personnel at sites
MATHI
Match? False
Cell to Cell Differences: All
All differences: (5, 2)
Old New
Exams go to?
14J558 MATHI_K 10 2
MATHI_R 1 2
20J445 MATHI_K 1 3
28W440 MATHI_K 1 4
29W283 MATHI_K 1 4
USH
Match? False
Cell to Cell Differences: All
All differences: (3, 2)
Old New
Exams go to?
02N600 USH_K 1 0
USH_R 0 1
13J499 USH_K 2 1
PHYS
Match? False
Cell to Cell Differences: All
All differences: (8, 2)
Old New
Exams go to?
02N475 PHYS_K 1 0
PHYS_R 0 1
15J519 PHYS_K 4 2
PHYS_R 1 2
28W690 PHYS_K 1 3
30W445 PHYS_K 1 2
31T450 PHYS_K 1 0
PHYS_R 0 1
GLOBALI
Match? False
Cell to Cell Differences: All
All differences: (6, 2)
Old New
Exams go to?
19J660 GLOBALI_K 3 1
GLOBALI_R 1 2
21J540 GLOBALI_K 3 2
GLOBALI_R 1 2
29W272 GLOBALI_K 1 3
GLOBALI_R 2 1
LIVING
Match? True
Cell to Cell Differences: All
All differences: (0, 2)
Empty DataFrame
Columns: [Old, New]
Index: []
GLOBALII
Match? False
Cell to Cell Differences: All
All differences: (7, 2)
Old New
Exams go to?
13J499 GLOBALII_K 6 2
GLOBALII_R 1 2
19J660 GLOBALII_K 4 2
21J540 GLOBALII_K 2 3
28W620 GLOBALII_K 2 4
GLOBALII_R 2 1
29W272 GLOBALII_K 2 4
The text was updated successfully, but these errors were encountered:
@christina-zhou-96 I think adding DataFrame.differences(self, other: pd.DataFrame, verbose: bool=False) would make sense (pretty much your _diff_table)
Would need a full doc-string w/o examples, as well as a section in the main docs.
Code Sample, a copy-pastable example if possible
Problem description
Often I need to do a quick analysis of the same dataframe over some time. I need to look at the specific cells that changed, and how. Frequently, it's at a volume where it's impossible to just eyeball this. I'm surprised that under pandas' testing suite there's not currently an in-built method to do this.
I ended up using manesioz's code from my stackoverflow question here in my own utility function which is included here. But, I think a feature like this (specifically the output under Quick Cell to Cell Differences) could be really useful for other data analysts too.
More sample output of other times I've used this feature:
The text was updated successfully, but these errors were encountered: