# Energy Fraud Detection

Imports for notebook:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
%matplotlib inline

#from itables import init_notebook_mode
#init_notebook_mode(all_interactive=True)

## Introduction

Looking at dataset of clients and their energy usage over time. The target variable is fraud, and is labelled on a per-client basis.

This is an interesting problem where data is partially corrupted, and has major inconsistencies. Furthermore, there is insufficient information to obtain target variable for all cases.

This notebook conducts the initial data cleaning, where the raw datasets are input, and the 'cleaned' datasets are output.

When rows are removed, they are placed in the df_removed_{train|test} dataset. At least one row for every 'client_id' is needed.

Where relevant, the last row will be kept for the 'client_id' if all would otherwise be removed.

Please refer to the main notebook, "???", for further details and continuation.

## Data Cleaning

### Data Import

In [2]:
# Read the CSV files
df_client_test = pd.read_csv('./client_test.csv', on_bad_lines='skip')
df_client_train = pd.read_csv('./client_train.csv', on_bad_lines='skip')
df_invoice_test = pd.read_csv('./invoice_test.csv', on_bad_lines='skip')
# low_memory is prompted due to unexpected values and large datasize
df_invoice_train = pd.read_csv('./invoice_train.csv', on_bad_lines='skip', low_memory=False)
df_SampleSubmission = pd.read_csv('./SampleSubmission (2).csv', on_bad_lines='skip')

In [3]:
df_SampleSubmission.head()

Unnamed: 0,client_id,target
0,test_Client_0,0.957281
1,test_Client_1,0.996425
2,test_Client_10,0.612359
3,test_Client_100,0.776933
4,test_Client_1000,0.571046


In [4]:
df_client_test.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date
0,62,test_Client_0,11,307,28/05/2002
1,69,test_Client_1,11,103,06/08/2009
2,62,test_Client_10,11,310,07/04/2004
3,60,test_Client_100,11,101,08/10/1992
4,62,test_Client_1000,11,301,21/07/1977


In [5]:
df_invoice_test.head()

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,test_Client_0,2018-03-16,11,651208,0,203,8,1,755,0,0,0,19145,19900,8,ELEC
1,test_Client_0,2014-03-21,11,651208,0,203,8,1,1067,0,0,0,13725,14792,8,ELEC
2,test_Client_0,2014-07-17,11,651208,0,203,8,1,0,0,0,0,14792,14792,4,ELEC
3,test_Client_0,2015-07-13,11,651208,0,203,9,1,410,0,0,0,16122,16532,4,ELEC
4,test_Client_0,2016-07-19,11,651208,0,203,9,1,412,0,0,0,17471,17883,4,ELEC


In [6]:
print(f"Number of rows in client train vs invoice train: {len(df_client_train)} vs {len(df_invoice_train)}")
print(f"Number of unique client_id in client train vs invoice train: {df_client_train['client_id'].nunique()} vs {df_invoice_train['client_id'].nunique()}")
print(f"Number of rows in client test vs invoice test: {len(df_client_test)} vs {len(df_invoice_test)}")
print(f"Number of unique client_id in client test vs invoice test: {df_client_test['client_id'].nunique()} vs {df_invoice_test['client_id'].nunique()}", end="")

Number of rows in client train vs invoice train: 135493 vs 4476749
Number of unique client_id in client train vs invoice train: 135493 vs 135493
Number of rows in client test vs invoice test: 58069 vs 1939730
Number of unique client_id in client test vs invoice test: 58069 vs 58069

Going to merge df_client_train and df_invoice_train:

In [7]:
df_test = df_invoice_test.join(df_client_test.set_index('client_id'), on='client_id', validate='m:1').copy()
df_train = df_invoice_train.join(df_client_train.set_index('client_id'), on='client_id', validate='m:1').copy()
df_train

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,...,consommation_level_4,old_index,new_index,months_number,counter_type,disrict,client_catg,region,creation_date,target
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,...,0,14302,14384,4,ELEC,60,11,101,31/12/1994,0.0
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,...,0,12294,13678,4,ELEC,60,11,101,31/12/1994,0.0
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,...,0,14624,14747,4,ELEC,60,11,101,31/12/1994,0.0
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,...,0,14747,14849,4,ELEC,60,11,101,31/12/1994,0.0
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,...,0,15066,15638,12,ELEC,60,11,101,31/12/1994,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476744,train_Client_99998,2005-08-19,10,1253571,0,202,9,1,400,135,...,0,3197,3732,8,ELEC,60,11,101,22/12/1993,0.0
4476745,train_Client_99998,2005-12-19,10,1253571,0,202,6,1,200,6,...,0,3732,3938,4,ELEC,60,11,101,22/12/1993,0.0
4476746,train_Client_99999,1996-09-25,11,560948,0,203,6,1,259,0,...,0,13884,14143,4,ELEC,60,11,101,18/02/1986,0.0
4476747,train_Client_99999,1996-05-28,11,560948,0,203,6,1,603,0,...,0,13281,13884,4,ELEC,60,11,101,18/02/1986,0.0


In [8]:
del df_invoice_train, df_invoice_test, df_client_train, df_client_test, df_SampleSubmission

### Renaming Columns and setting Data Types

There are some comments from the source regarding the meaning of these columns. These included here verbatim.
* "client_train.csv":
* "disrict: District where the client is"
* "client_id: Unique id for client"
* "client_catg: Category client belongs to"
* "region: Area where the client is"
* "creation_date: Date client joined"
* "target: fraud:1, not fraud: 0"
* "invoice_train.csv":
* "client_id: Unique id for client"
* "invoice_date: Date of the invoice"
* "tarif_type: Type of tax"
* "counter_number: number"
* "counter_statue: akes up to 5 values such as working fine, not working, on hold statue, ect"
* "counter_code: code"
* "reading_remarque: notes that the STEG agent takes during his visit to the cleint (e.g.: if the counter shows something wrong, the"
* "counter_coefficient: An additional coefficient to be added when standard consumption is exceeded"
* "consommation_level_1: Consumption_level_1"
* "consommation_level_2: Consumption_level_2"
* "consommation_level_3: Consumption_level_3"
* "consommation_level_4: Consumption_level_4"
* "old_index: Old index"
* "new_index: New index"
* "months_number: Month number"
* "counter_type: Type of counter"

Going to rename the columns slightly to better match my interpretation of the data gained from inspection.

In [9]:
rename_dict = {'client_id' : 'client_id', 
               'invoice_date' : 'invoice_date', 
               'tarif_type' : 'mtr_tariff', 
               'counter_number' : 'mtr_id',
               'counter_statue' : 'mtr_status', 
               'counter_code' : 'mtr_code', 
               'reading_remarque' : 'mtr_notes',
               'counter_coefficient' : 'mtr_coef', 
               'consommation_level_1' : 'usage_1', 
               'consommation_level_2' : 'usage_2',
               'consommation_level_3' : 'usage_3', 
               'consommation_level_4' : 'usage_4', 
               'old_index' : 'mtr_val_old',
               'new_index' : 'mtr_val_new', 
               'months_number': 'months_num', 
               'counter_type' : 'mtr_type', 
               'disrict' : 'district', 
               'client_catg' : 'client_type',
               'region' : 'region', 
               'creation_date' : 'start_date', 
               'target' : 'fraud'}

df_test.rename(columns=rename_dict, inplace=True) # Note that test won't have target
df_train.rename(columns=rename_dict, inplace=True)

Converting data types where appropriate:

In [10]:
# invoice_date: object -> date [YYYY-MM-DD] -> [YYYY-MM-DD]
df_train['invoice_date'] = pd.to_datetime(df_train['invoice_date'])
# start_date: object -> date [DD/MM/YYYY] -> [YYYY-MM-DD]
df_train['start_date'] = pd.to_datetime(df_train['start_date'], dayfirst=True)

col_names = ['mtr_type', 'district', 'client_type', 'region']
df_train[col_names] = df_train[col_names].astype("category")

# Converting 'mtr_val_' into float for now, as decimal places get in the way otherwise
df_train['mtr_val_old'] = df_train['mtr_val_old'].astype(float).round(0)
df_train['mtr_val_new'] = df_train['mtr_val_new'].astype(float).round(0)

### Data Structure and Hierarchy
Please refer to main notebook for the context.

In [11]:
# This will change depending on context, but is the generic grouping to order rows.
grp_cols = ['client_id', 'mtr_type', 'mtr_tariff', 'mtr_id'] # Then to be sorted by: 'invoice_date'
# This will need to be updated as aspects such dates change.
df_train.sort_values(by=grp_cols+['invoice_date', 'mtr_code'], inplace=True)

Note that very rarely, two 'mtr_code's can occupy an 'invoice_date' in this grouping. Since this is costly to rectify, and it is so rare, it will only be checked for twice. The first being after the 'invoice_date' column is corrected from corruption, and the second being near the end to ensure no new clashes.

Based on data mining, some specific relationships have been deduced:
* usage_1 -> usage_4 are sequential "buckets" of usage, that if capped, can "spill" over into the next level.
* It is unclear how the presence of a cap is determined:
  * If mtr_tariff == [10|11] & mtr_code != [3__], more than usage_1 may be used.
  * If usage_3 is being used, so will usage_4.
  * usage_3 cap is equal to usage_1 cap. usage_2 cap is equal half usage_1 cap. usage_4 is uncapped.
  * Order of use is sequential; i.e., cannot use usage_2 before usage_1, etc.
  * If monthly cap == [50 | 300], usage_3 & usage_4 are not used.
* It is unclear how the quantity of the cap is determined:
  * The mtr_tariff and mtr_code seem to have an influence. mtr_status also seems to be influential.
  * If mtr_tariff == [10] & mtr_code == [1__], monthly cap is typically 50 | 100.
  * If mtr_tariff == [10] & mtr_code == [2__], monthly cap is typically 50 | 200.
  * If mtr_tariff == [11], monthly cap is typically 200 | 300.
* The "monthly cap" would be multiplied by the "months_num" to give the cap. This applies most of the time but cannot be relied upon.
* It is being assumed that there is an allowance of energy at different rates, provided on a monthly basis, thus requiring the scaling.
* mtr_val_old and mtr_val_new are meant to track the meter reading:
  * mtr_val_old will be prior to adding the invoice's usage, and mtr_val_new will be after.
  * The usage ("usage_n") is the sum of usage_1 -> usage_4, all multiplied by 'mtr_coef'.
  * This applies most of the time.
* On 'mtr_coef':
  * If mtr_coef > 1, and [usage_3|usage_4] > 0, monthly cap seems to be 200.
  * If mtr_coef > 1, and [usage_3|usage_4] > 0, 'mtr_code' == [483|5__].

Conclusions based on the above:
* A current invoice's mtr_val_old should match previous invoice's mtr_val_new if there was no gap in between. It can be more if there is a gap. It should not be less.
* Similarly, the months_num should equal the difference in invoice dates in months (if there were no gaps). It can be less if there is a gap. It should not be more.
* The mtr_val_old plus the usage should equal mtr_val_new.
* The delta of the mtr_val_new of the current invoice and previous invoice should be energy used during the delta of the two invoice dates.

### Data Quality Markers

Going to use some helper functions that identify data inconsistencies as compared to expectations.
#### Intra-Row: Energy Usage
* 'usage_N' is sum of the 'usage_{1|2|3|4}' columns.
* 'usage_N' / 'mtr_coef' = 'usage_n' which represents, presumably, the amount being paid for.
* 'mtr_val_old' + 'usage_n' = 'mtr_val_new'.

'Calc_Usage()' calculates these metrics and derives them via different possible permutations.

In [12]:
def Calc_Usage(df_train):
    rows = len(df_train)
    df_train['usage_N'] = df_train.loc[:, ['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1)
    df_train['usage_n'] = (df_train['usage_N'] / df_train.loc[:, 'mtr_coef']).round(0)
    df_train['monthly_usage'] = df_train['usage_n'].div(df_train['months_num'])
    # Calculating permutations of this:
    df_train['usage_n_calc'] = df_train['mtr_val_new'] - df_train['mtr_val_old']
    df_train['mtr_coef_calc'] = (df_train['usage_N'] / df_train['usage_n_calc']).round(1)
    df_train['mtr_val_new_calc'] = df_train['mtr_val_old'] + df_train['usage_n'].round(0)
    df_train['mtr_val_old_calc'] = df_train['mtr_val_new'] - df_train['usage_n'].round(0)
    mask = df_train['usage_n_calc'] != df_train['usage_n']
    print(f"Rows with unexpected usage values / Total Rows: {sum(mask)} / {rows}.")
    df_train['usage_flag'] = mask.astype(int)
    return df_train.copy()
df_train = Calc_Usage(df_train.copy())

Rows with unexpected usage values / Total Rows: 17533 / 4476749.


#### Inter-Row: Meter and Date Consistency
* 'mtr_val_{old|new}' should monotonically increase over time. 
* 'invoice_date' - 'months_num' >= Prev('invoice_date') and should strictly increase over time.

'Calc_Neighbours()' calculates these relationships. Since 'months_num' is an integer and vague in its definition, will be adding 1 month leeway to calculations.

In [13]:
def Calc_Neighbours(df_train, grp_cols=grp_cols):
    rows = len(df_train)
    # Group, Find neighbours
    df_grp = df_train.groupby(grp_cols, observed=True)
    df_train['mtr_val_new_prv_2'] = df_grp['mtr_val_new'].shift(2)
    col_names = ['mtr_val_old', 'mtr_val_new', 'invoice_date', 'months_num', 'monthly_usage']
    df_train[[col_name+'_prv' for col_name in col_names]] = (df_grp[col_names].shift(1))
    df_train[[col_name+'_nxt' for col_name in col_names[:2]]] = (df_grp[col_names[:2]].shift(-1))
    df_train['mtr_val_old_nxt_2'] = df_grp['mtr_val_old'].shift(-2)
    # Calculate expected metrics based on relationships
    df_train['invoice_date_prv_calc'] = pd.NaT # If months_num is unreasonable it would crash
    mask = (df_train['months_num'] > 0) & (df_train['months_num'] < 600) & (df_train['months_num'].notna()) # 0 - 50 Years
    df_train['invoice_date_prv_calc'] = df_train.loc[mask, 'invoice_date'] - pd.to_timedelta(df_train.loc[mask, 'months_num'].astype(int) * 30.5, unit='days')
    df_train['months_num_calc'] = np.ceil(((df_train['invoice_date'] - df_train['invoice_date_prv']).dt.days.values / 30.5))
    df_train['months_gap_calc'] = df_train['months_num_calc'] - df_train['months_num'] # Difference in dates in months
    # Check Rules
    mask = ((df_train['mtr_val_old'] < df_train['mtr_val_new_prv']) & (df_train['mtr_val_new_prv'].notna()))
    df_train['mtr_flag_bkd'] = mask.astype(int)
    print(f"Rows where meter readings seem out of order looking backwards / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['mtr_val_new_prv'].isna())}.")
    mask = ((df_train['mtr_val_new'] > df_train['mtr_val_old_nxt']) & (df_train['mtr_val_old_nxt'].notna()))
    df_train['mtr_flag_fwd'] = mask.astype(int)
    print(f"Rows where meter readings seem out of order looking forwards / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['mtr_val_old_nxt'].isna())}.")
    mask = ((df_train['invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') < df_train['invoice_date_prv']) & (df_train['invoice_date_prv_calc'].notna()))
    df_train['date_flag'] = mask.astype(int)
    print(f"Rows where invoice spans seem to overlap based on \'months_num\' / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['months_num'].isna())}.")
    mask = ((df_train['months_num_calc'] < df_train['months_num']) & (df_train['invoice_date_prv_calc'].notna()))
    df_train.loc[mask, 'date_flag'] = 1
    print(f"Rows where invoice spans seem to overlap based on \'invoice_date\'s / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['months_num'].isna())}.")
    return df_train.copy()

In [14]:
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows where meter readings seem out of order looking backwards / Total Rows: 597824 / 4476749. Excluded NAs: 233780.
Rows where meter readings seem out of order looking forwards / Total Rows: 597824 / 4476749. Excluded NAs: 233780.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 1014829 / 4476749. Excluded NAs: 0.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 1014871 / 4476749. Excluded NAs: 0.


### Data Corruption

#### Column Misalignment
##### Focusing on 'mtr_status'
'mtr_status' had some suspiciously rare values that were manually inspected: [3, 2, 46, A, 628, 769, 269375, 420]. 

From these, Values: ['46', '618', '769', '269375', '420'] seemed symptomatic of data quality issues.

In [15]:
mask = df_train['mtr_status'].isin(['46', '618', '769', '269375', '420']) # Manual check showed 'A' seems acceptable
print(f'number of bad mtr_status: {sum(mask)}.')
col_names = ['mtr_tariff', 'mtr_id', 'mtr_status', 'mtr_code', 'mtr_notes', 'mtr_coef', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'mtr_val_old', 'mtr_val_new', 'months_num']
pd.concat([df_train[~mask].head(5), df_train[mask].head(5)])[col_names] # Comparison

number of bad mtr_status: 34.


Unnamed: 0,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num
22,11,1335667,0,203,6,1,124,0,0,0,3685.0,3809.0,4
23,11,1335667,0,203,6,1,141,0,0,0,3809.0,3950.0,4
24,11,1335667,0,203,6,1,162,0,0,0,3950.0,4112.0,4
25,11,1335667,0,203,6,1,159,0,0,0,4112.0,4271.0,4
28,11,1335667,0,203,6,1,182,0,0,0,4271.0,4453.0,4
1178214,11,170,769,0,207,6,1,332,0,0,0.0,0.0,332
1178211,11,170,769,0,207,6,1,385,0,0,0.0,332.0,717
1178200,11,170,769,0,207,6,1,479,0,0,0.0,717.0,1196
1178209,11,170,769,0,207,6,1,437,0,0,0.0,1196.0,1633
1178207,11,170,769,0,207,6,1,453,0,0,0.0,1633.0,2086


For these, it is assumed that the columns have been shifted by mistake. One key sign here is usage_1 == 1 and mtr_coef == 6 | 8 | 9.

It is not clear why. In particular, it is not clear what 'mtr_id' and/or 'mtr_status' is meant to represent here and which might be wrong. 'mtr_id' will be kept and 'mtr_status' will be deleted.

'months_num' will be empty for now.

In [16]:
# Flag and modify
df_train['col_shift_flag'] = mask.astype(int)
df_train.loc[mask, df_train.columns[4:14]] = df_train.loc[mask, df_train.columns[5:15]].values
df_train.loc[mask, df_train.columns[14]] = np.nan
df_train = Calc_Usage(df_train.copy())
pd.concat([df_train[~mask].head(5), df_train[mask].head(5)])[col_names] # Comparison

Rows with unexpected usage values / Total Rows: 17502 / 4476749.


Unnamed: 0,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num
22,11,1335667,0.0,203,6,1,124,0,0,0,3685.0,3809.0,4.0
23,11,1335667,0.0,203,6,1,141,0,0,0,3809.0,3950.0,4.0
24,11,1335667,0.0,203,6,1,162,0,0,0,3950.0,4112.0,4.0
25,11,1335667,0.0,203,6,1,159,0,0,0,4112.0,4271.0,4.0
28,11,1335667,0.0,203,6,1,182,0,0,0,4271.0,4453.0,4.0
1178214,11,170,0.0,207,6,1,332,0,0,0,0.0,332.0,
1178211,11,170,0.0,207,6,1,385,0,0,0,332.0,717.0,
1178200,11,170,0.0,207,6,1,479,0,0,0,717.0,1196.0,
1178209,11,170,0.0,207,6,1,437,0,0,0,1196.0,1633.0,
1178207,11,170,0.0,207,6,1,453,0,0,0,1633.0,2086.0,


##### Focusing on 'mtr_coef'
One identified issue is assumed to be caused by 'mtr_ceof' having had a decimal place that caused an offset in columns. This was deduced by suspicious 'months_num' and 'usage_2' values originally.

For example, "1,5" would cause the "5" to fall into 'usage_1', etc. 'months_num' was then overwritten and lost.

'mtr_coef' would therefore need to be reconstructed for these examples, and columns shifted. 

The lost 'months_num' are empty for now.

Rows have to be filtered carefully to identify only those related to the described data corruption problem:
* If(0 < 'usage_1' < 10), then flag.
  * On the assumption it is single decimal place, it must be a single digit.
  * All other conditions are applied after this.
* If('usage_2' > 0) or If('months_num' > 240), then flag.
  * The minimum monthly cap seen is 50, so 'usage_2' should not be used prior to that.
  * This would not apply if only true 'usage_1' was used, so is insufficient.
  * 'months_num' would not reasonably exceed 20 years.

In [17]:
# Finding those affected
mask = (((df_train['usage_1'] > 0) & (df_train['usage_1'] < 10)) &
        ((df_train['usage_2'] != 0)) | (df_train['months_num'] > 240))
df_temp = df_train[mask]
col_names = ['mtr_coef', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'mtr_val_old', 'mtr_val_new', 'months_num', 'usage_N', 'usage_n', 'usage_n_calc']
pd.concat([df_train[~mask].head(5), df_train[mask].head(5)])[col_names] # Comparison

Unnamed: 0,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,usage_N,usage_n,usage_n_calc
22,1,124,0,0,0,3685.0,3809.0,4.0,124,124.0,124.0
23,1,141,0,0,0,3809.0,3950.0,4.0,141,141.0,141.0
24,1,162,0,0,0,3950.0,4112.0,4.0,162,162.0,162.0
25,1,159,0,0,0,4112.0,4271.0,4.0,159,159.0,159.0
28,1,182,0,0,0,4271.0,4453.0,4.0,182,182.0,182.0
20214,1,5,1200,3024,0,0.0,495.0,3311.0,4229,4229.0,495.0
20219,1,5,229,0,0,0.0,342.0,495.0,234,234.0,342.0
20212,1,5,1200,10566,0,0.0,9971.0,17815.0,11771,11771.0,9971.0
20213,1,5,1200,8790,0,0.0,3311.0,9971.0,9995,9995.0,3311.0
20211,1,5,1200,10744,0,0.0,17815.0,25778.0,11949,11949.0,17815.0


In [18]:
# Enact the change, and keep record
df_train.loc[mask, 'col_shift_flag'] = 1
df_train['unk_months_num'] = mask.astype(int)
df_train['mtr_coef'] = df_train['mtr_coef'].astype(float)
df_train.loc[mask, 'mtr_coef'] = df_train.loc[mask, 'mtr_coef'] + (df_train.loc[mask, 'usage_1'].astype(float) / 10).round(1)
df_train.loc[mask, df_train.columns[8:14]] = df_train.loc[mask, df_train.columns[9:15]].values
df_train.loc[mask, df_train.columns[14]] = np.nan
print(f"Number of Rows where Columns seem offset based on ruleset / Total Rows: {sum(mask)} / {len(df_train)}.")
df_train = Calc_Usage(df_train.copy())
df_train[mask][col_names]

Number of Rows where Columns seem offset based on ruleset / Total Rows: 1392 / 4476749.
Rows with unexpected usage values / Total Rows: 16436 / 4476749.


Unnamed: 0,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,usage_N,usage_n,usage_n_calc
20214,1.5,1200,3024,0,0,495.0,3311.0,,4224,2816.0,2816.0
20219,1.5,229,0,0,0,342.0,495.0,,229,153.0,153.0
20212,1.5,1200,10566,0,0,9971.0,17815.0,,11766,7844.0,7844.0
20213,1.5,1200,8790,0,0,3311.0,9971.0,,9990,6660.0,6660.0
20211,1.5,1200,10744,0,0,17815.0,25778.0,,11944,7963.0,7963.0
...,...,...,...,...,...,...,...,...,...,...,...
4457223,1.5,200,100,200,1810,459733.0,461273.0,,2310,1540.0,1540.0
4457217,1.5,1000,407,0,0,465008.0,465946.0,,1407,938.0,938.0
4457218,1.5,200,100,200,239,463554.0,464047.0,,739,493.0,493.0
4457226,1.5,200,100,200,1321,458519.0,459733.0,,1821,1214.0,1214.0


#### Poor Date Parsing
Another identified issue is assumed caused by the original uploader of the dataset to Kaggle relying on automated date parsing, resulting in months and days being switched in places. This was deduced by considering the 'invoice_date', 'mtr_val_old', 'mtr_val_new', and 'months_num' fields. The significance is that 'months_num' is calculated on true 'invoice_date' and not the recorded 'invoice_date' here.

Using the data hierarchy mentioned before, it would broadly be expected for 'mtr_val_old' and 'mtr_val_new' to increase monotonically.

In [19]:
# Finding those affected
mask = (df_train['invoice_date'].dt.day <= 12) & (df_train['invoice_date'].dt.month <= 12)
print(f"Rows where meter readings day and month could have been switched / Total Rows: {sum(mask)} / {len(df_train)}.") # ~1/3 of times
print(f"Rows where meter readings seem out of order / Rows where day and month could have been switched: {sum(((df_train['mtr_flag_bkd'] == 1) | (df_train['mtr_flag_fwd'] == 1)) & mask)} / {sum(mask)}.") # ~1/3 of times

Rows where meter readings day and month could have been switched / Total Rows: 1976635 / 4476749.
Rows where meter readings seem out of order / Rows where day and month could have been switched: 998929 / 1976635.


In [20]:
# Flip the day and month for these cases, and keep record. There might be a vectorised way for this to speed this up...
df_train['date_flip_flag'] = mask.astype(int)
df_train.loc[mask, 'invoice_date'] = df_train.loc[mask, 'invoice_date'].apply(lambda x: x.replace(day=x.month, month=x.day) if pd.notna(x) else x)
df_train.sort_values(by=grp_cols+['invoice_date', 'mtr_code', 'usage_n'], inplace=True) # Update the sort to reflect new dates, including usage_n henceforth
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows where meter readings seem out of order looking backwards / Total Rows: 11899 / 4476749. Excluded NAs: 233780.
Rows where meter readings seem out of order looking forwards / Total Rows: 11899 / 4476749. Excluded NAs: 233780.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 4480 / 4476749. Excluded NAs: 1426.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 4525 / 4476749. Excluded NAs: 1426.


Given how significantly the data quality metric improved, it is unlikely to be due to chance. However, there is still a chance that some did not need switching, for which is nothing is done, unfortunately.

#### "Deep" sorting
There is a complication in that very rarely, this grouping, for a given 'invoice_date', may contain two 'mtr_code'. In this case, the ordering should be to try and match 'mtr_code' to the neighbouring rows.

Unfortunately, this requires a more expensive grouping / sorting operation, contained in the 'Calc_Deep_Sort()' function below.

In [21]:
def Calc_Deep_Sort(df_train, grp_cols=grp_cols):
    # Note this assumes typical sort has already been done!

    df_temp = df_train.groupby(grp_cols + ['invoice_date', 'temp_flag'],observed=True)[['mtr_code']].nunique().reset_index()
    mask = (df_temp['mtr_code'] > 1)
    df_temp[mask]

    df_temp = []
    for _, group in df_train.groupby(grp_cols,observed=True):
        # Find dates with 2 different values
        date_counts = group.groupby('invoice_date')['mtr_code'].nunique()
        dates_with_pairs = date_counts[date_counts == 2].index
        if len(dates_with_pairs) > 0:
            for date in dates_with_pairs:
                # Get the pair of rows for this date
                mask = (group['invoice_date'] == date)
                pair_rows = group[mask]
                # Get values before and after
                before_value = group[group['invoice_date'] < date].iloc[-1]['mtr_code'] if any(group['invoice_date'] < date) else None
                after_value = group[group['invoice_date'] > date].iloc[0]['mtr_code'] if any(group['invoice_date'] > date) else None
                # Score each order based on matches with neighbors
                score_order = sum([1 if before_value == pair_rows.iloc[0]['mtr_code'] else 0,
                                   1 if after_value == pair_rows.iloc[1]['mtr_code'] else 0,
                                   -1 if before_value == pair_rows.iloc[1]['mtr_code'] else 0,
                                   -1 if after_value == pair_rows.iloc[0]['mtr_code'] else 0])
                # Flip if needed
                if score_order < 0:
                    #pair_rows.index = pair_rows.index[::-1]
                    #group.loc[mask] = pair_rows.iloc[::-1].values
                    #group.iloc[pair_rows.index] = pair_rows.iloc[pair_rows.index[::-1]].values
                    group.loc[mask, 'temp_flag'] = [2, 1]
        df_temp.append(group)
    return pd.concat(df_temp)

In [57]:
#df_train_safe = df_train.copy()
df_train = df_train_safe.copy()

In [22]:
df_train['temp_flag'] = 0
df_train = df_train.reset_index()
df_train['idx'] = df_train['index']
df_train

Unnamed: 0,index,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,...,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag,temp_flag,idx
0,22,train_Client_0,2005-10-17,11,1335667,0,203,6,1.0,124,...,,,0,0,0,0,0,0,0,22
1,23,train_Client_0,2006-02-24,11,1335667,0,203,6,1.0,141,...,5.0,1.0,0,0,0,0,0,0,0,23
2,24,train_Client_0,2006-06-23,11,1335667,0,203,6,1.0,162,...,4.0,0.0,0,0,0,0,0,0,0,24
3,25,train_Client_0,2006-10-18,11,1335667,0,203,6,1.0,159,...,4.0,0.0,0,0,0,0,0,0,0,25
4,28,train_Client_0,2007-02-26,11,1335667,0,203,6,1.0,182,...,5.0,1.0,0,0,0,0,0,0,0,28
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476744,4476744,train_Client_99998,2005-08-19,10,1253571,0,202,9,1.0,400,...,,,0,0,0,0,0,0,0,4476744
4476745,4476745,train_Client_99998,2005-12-19,10,1253571,0,202,6,1.0,200,...,4.0,0.0,0,0,0,0,0,0,0,4476745
4476746,4476748,train_Client_99999,1996-01-25,11,560948,0,203,6,1.0,516,...,,,0,0,0,0,0,0,0,4476748
4476747,4476747,train_Client_99999,1996-05-28,11,560948,0,203,6,1.0,603,...,5.0,1.0,0,0,0,0,0,0,0,4476747


In [23]:
df_train['temp_flag'] = df_train.groupby(grp_cols + ['invoice_date'], observed=True)[['mtr_code']].transform('nunique') > 1
mask = df_train['temp_flag'] == True

In [24]:
df_train['temp_flag'] = np.nan
df_train.loc[mask, 'temp_flag'] = df_train.groupby(grp_cols, observed=True)['mtr_code'].shift(1) #prev
df_train['temp_flag_2'] = np.nan
df_train.loc[mask, 'temp_flag_2'] = (df_train.loc[mask, 'mtr_code'] != df_train.loc[mask, 'temp_flag']).astype(int)

In [25]:
df_train['temp_flag_2'] = df_train.groupby(grp_cols + ['invoice_date'], observed=True)[['temp_flag_2']].transform('max') * -1
df_train['temp_flag_2'] = df_train.groupby(grp_cols + ['invoice_date'], observed=True)[['temp_flag_2']].transform('cumsum')



In [26]:
df_train.sort_values(by=grp_cols+['invoice_date', 'temp_flag_2', 'usage_n'], inplace=True)

In [None]:
df_train['temp_flag_2'] = df_train.groupby(grp_cols + ['invoice_date'], observed=True)[['temp_flag_2']].apply(lambda x: x[::-1].cumsum()[::-1]).to_list()

In [59]:
df_temp = df_train.groupby(grp_cols + ['invoice_date', 'temp_flag'],observed=True)[['mtr_code']].nunique().reset_index()
mask = (df_temp['mtr_code'] > 1)
df_temp[mask]

Unnamed: 0,client_id,mtr_type,mtr_tariff,mtr_id,invoice_date,temp_flag,mtr_code
1812,train_Client_10005,ELEC,11,777401,2014-03-04,0,2
10339,train_Client_100283,ELEC,11,280,2016-04-07,0,2
26940,train_Client_100747,ELEC,11,236664,2008-02-08,0,2
34488,train_Client_100941,ELEC,11,516212,2011-09-29,0,2
39507,train_Client_101062,ELEC,11,538072,2010-12-03,0,2
...,...,...,...,...,...,...,...
4369774,train_Client_97084,GAZ,40,2514044,2016-02-12,0,2
4391569,train_Client_97708,ELEC,11,618717,2014-02-03,0,2
4405065,train_Client_98066,ELEC,11,8414,2011-10-05,0,2
4407353,train_Client_98130,ELEC,11,57283,2014-12-24,0,2


In [61]:
df_train[df_train[grp_cols + ['invoice_date', 'temp_flag']].apply(tuple, axis=1).isin(df_temp[mask][grp_cols + ['invoice_date', 'temp_flag']].apply(tuple, axis=1))]

Unnamed: 0,index,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,...,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag,temp_flag,idx
1812,1812,train_Client_10005,2014-03-04,11,777401,0,207,9,1.0,225,...,4.0,2.0,0,1,0,0,0,1,0,1812
1813,1811,train_Client_10005,2014-03-04,11,777401,5,413,6,1.0,0,...,0.0,-2.0,1,0,1,0,0,1,0,1811
10341,10347,train_Client_100283,2016-04-07,11,280,5,207,6,1.0,0,...,,,0,0,0,0,0,1,0,10347
10342,10346,train_Client_100283,2016-04-07,11,280,0,413,9,1.0,400,...,0.0,-2.0,0,0,1,0,0,1,0,10346
26943,26942,train_Client_100747,2008-02-08,11,236664,5,203,6,1.0,0,...,5.0,3.0,0,1,0,0,0,1,0,26942
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4405606,4405608,train_Client_98066,2011-10-05,11,8414,5,413,6,1.0,355,...,0.0,-2.0,0,0,1,0,0,1,0,4405608
4407894,4407893,train_Client_98130,2014-12-24,11,57283,5,203,6,1.0,0,...,4.0,2.0,0,0,0,0,0,0,0,4407893
4407895,4407892,train_Client_98130,2014-12-24,11,57283,0,413,9,1.0,400,...,0.0,-2.0,0,0,1,0,0,0,0,4407892
4408865,4408887,train_Client_98157,2010-10-25,11,9819985,5,203,6,1.0,0,...,11.0,3.0,0,0,0,0,0,0,0,4408887


In [53]:
df_temp = df_train.merge(df_temp.loc[mask, grp_cols+['temp_flag']], on=grp_cols+['temp_flag'], how='inner').copy().set_index('index') #.reset_index()#.set_index('index').copy()#.reset_index()

In [54]:
df_temp[df_temp['client_id'] == 'train_Client_10005'][['invoice_date', 'mtr_code', 'idx', 'temp_flag']]

Unnamed: 0_level_0,invoice_date,mtr_code,idx,temp_flag
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1798,2009-05-05,413,1798,0
1807,2009-08-27,413,1807,0
1826,2010-03-10,413,1826,0
1803,2010-07-05,413,1803,0
1825,2010-11-02,413,1825,0
1824,2011-03-11,413,1824,0
1823,2011-07-06,413,1823,0
1822,2011-11-03,413,1822,0
1800,2012-03-09,413,1800,0
1821,2012-07-06,413,1821,0


In [55]:
df_temp = Calc_Deep_Sort(df_temp.copy(), grp_cols)
df_temp[df_temp['client_id'] == 'train_Client_10005']

Index(['client_id', 'invoice_date', 'mtr_tariff', 'mtr_id', 'mtr_status',
       'mtr_code', 'mtr_notes', 'mtr_coef', 'usage_1', 'usage_2', 'usage_3',
       'usage_4', 'mtr_val_old', 'mtr_val_new', 'months_num', 'mtr_type',
       'district', 'client_type', 'region', 'start_date', 'fraud', 'usage_N',
       'usage_n', 'monthly_usage', 'usage_n_calc', 'mtr_coef_calc',
       'mtr_val_new_calc', 'mtr_val_old_calc', 'usage_flag',
       'mtr_val_new_prv_2', 'mtr_val_old_prv', 'mtr_val_new_prv',
       'invoice_date_prv', 'months_num_prv', 'monthly_usage_prv',
       'mtr_val_old_nxt', 'mtr_val_new_nxt', 'mtr_val_old_nxt_2',
       'invoice_date_prv_calc', 'months_num_calc', 'months_gap_calc',
       'mtr_flag_bkd', 'mtr_flag_fwd', 'date_flag', 'col_shift_flag',
       'unk_months_num', 'date_flip_flag', 'temp_flag', 'idx'],
      dtype='object')


Unnamed: 0_level_0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag,temp_flag,idx
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1798,train_Client_10005,2009-05-05,11,777401,1,413,6,1.0,5,0,...,,,0,0,0,0,0,1,0,1798
1807,train_Client_10005,2009-08-27,11,777401,0,413,6,1.0,3,0,...,4.0,0.0,0,0,0,0,0,0,0,1807
1826,train_Client_10005,2010-03-10,11,777401,0,413,6,1.0,19,0,...,7.0,3.0,0,0,0,0,0,1,0,1826
1803,train_Client_10005,2010-07-05,11,777401,0,413,6,1.0,142,0,...,4.0,0.0,0,0,0,0,0,1,0,1803
1825,train_Client_10005,2010-11-02,11,777401,0,413,6,1.0,370,0,...,4.0,0.0,0,0,0,0,0,1,0,1825
1824,train_Client_10005,2011-03-11,11,777401,0,413,9,1.0,407,0,...,5.0,1.0,0,0,0,0,0,1,0,1824
1823,train_Client_10005,2011-07-06,11,777401,0,413,8,1.0,12,0,...,4.0,2.0,0,0,0,0,0,1,0,1823
1822,train_Client_10005,2011-11-03,11,777401,0,413,6,1.0,0,0,...,4.0,0.0,0,0,0,0,0,1,0,1822
1800,train_Client_10005,2012-03-09,11,777401,0,413,8,1.0,0,0,...,5.0,1.0,0,0,0,0,0,1,0,1800
1821,train_Client_10005,2012-07-06,11,777401,0,413,8,1.0,0,0,...,4.0,0.0,0,0,0,0,0,1,0,1821


In [None]:
df_temp = df_train.merge(df_temp[grp_cols], on=grp_cols, how='inner').copy().set_index('index')

In [128]:
df_temp = df_temp.set_index('idx')
df_temp[df_temp['client_id'] == 'train_Client_10005']

Unnamed: 0_level_0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,mtr_val_old_nxt_2,invoice_date_prv_calc,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1798,train_Client_10005,2009-05-05,11,777401,1,413,6,1.0,5,0,...,9.0,2009-03-05,,,0,0,0,0,0,1
1807,train_Client_10005,2009-08-27,11,777401,0,413,6,1.0,3,0,...,28.0,2009-04-27,4.0,0.0,0,0,0,0,0,0
1826,train_Client_10005,2010-03-10,11,777401,0,413,6,1.0,19,0,...,170.0,2009-11-08,7.0,3.0,0,0,0,0,0,1
1803,train_Client_10005,2010-07-05,11,777401,0,413,6,1.0,142,0,...,540.0,2010-03-05,4.0,0.0,0,0,0,0,0,1
1825,train_Client_10005,2010-11-02,11,777401,0,413,6,1.0,370,0,...,1092.0,2010-07-03,4.0,0.0,0,0,0,0,0,1
1824,train_Client_10005,2011-03-11,11,777401,0,413,9,1.0,407,0,...,1104.0,2010-11-09,5.0,1.0,0,0,0,0,0,1
1823,train_Client_10005,2011-07-06,11,777401,0,413,8,1.0,12,0,...,1104.0,2011-05-06,4.0,2.0,0,0,0,0,0,1
1822,train_Client_10005,2011-11-03,11,777401,0,413,6,1.0,0,0,...,1104.0,2011-07-04,4.0,0.0,0,0,0,0,0,1
1800,train_Client_10005,2012-03-09,11,777401,0,413,8,1.0,0,0,...,1104.0,2011-11-08,5.0,1.0,0,0,0,0,0,1
1821,train_Client_10005,2012-07-06,11,777401,0,413,8,1.0,0,0,...,1108.0,2012-03-06,4.0,0.0,0,0,0,0,0,1


In [144]:
df_temp

Unnamed: 0_level_0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,mtr_val_old_nxt_2,invoice_date_prv_calc,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1798,train_Client_10005,2009-05-05,11,777401,1,413,6,1.0,5,0,...,9.0,2009-03-05,,,0,0,0,0,0,1
1807,train_Client_10005,2009-08-27,11,777401,0,413,6,1.0,3,0,...,28.0,2009-04-27,4.0,0.0,0,0,0,0,0,0
1826,train_Client_10005,2010-03-10,11,777401,0,413,6,1.0,19,0,...,170.0,2009-11-08,7.0,3.0,0,0,0,0,0,1
1803,train_Client_10005,2010-07-05,11,777401,0,413,6,1.0,142,0,...,540.0,2010-03-05,4.0,0.0,0,0,0,0,0,1
1825,train_Client_10005,2010-11-02,11,777401,0,413,6,1.0,370,0,...,1092.0,2010-07-03,4.0,0.0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4408913,train_Client_98157,2014-10-23,11,9819985,0,207,8,1.0,85,0,...,51294.0,2014-08-23,5.0,3.0,0,0,0,0,0,0
4408910,train_Client_98157,2015-06-23,11,9819985,0,207,8,1.0,16,0,...,51649.0,2015-04-23,8.0,6.0,0,0,0,0,0,0
4408855,train_Client_98157,2016-06-21,11,9819985,0,207,8,1.0,355,0,...,53059.0,2016-02-20,12.0,8.0,0,0,0,0,0,0
4408858,train_Client_98157,2017-02-27,11,9819985,0,207,9,1.0,1200,210,...,,2016-08-28,9.0,3.0,0,0,0,0,0,0


In [141]:
mask = df_temp.index.values
mask

array([   1798,    1807,    1826, ..., 4408855, 4408858, 4408860])

In [139]:
df_train.iloc[df_train.loc[df_train['idx'].isin(mask)].index]

Unnamed: 0,index,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,...,invoice_date_prv_calc,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag,idx
1798,1798,train_Client_10005,2009-05-05,11,777401,1,413,6,1.0,5,...,2009-03-05,,,0,0,0,0,0,1,1798
1799,1807,train_Client_10005,2009-08-27,11,777401,0,413,6,1.0,3,...,2009-04-27,4.0,0.0,0,0,0,0,0,0,1807
1800,1826,train_Client_10005,2010-03-10,11,777401,0,413,6,1.0,19,...,2009-11-08,7.0,3.0,0,0,0,0,0,1,1826
1801,1803,train_Client_10005,2010-07-05,11,777401,0,413,6,1.0,142,...,2010-03-05,4.0,0.0,0,0,0,0,0,1,1803
1802,1825,train_Client_10005,2010-11-02,11,777401,0,413,6,1.0,370,...,2010-07-03,4.0,0.0,0,0,0,0,0,1,1825
1803,1824,train_Client_10005,2011-03-11,11,777401,0,413,9,1.0,407,...,2010-11-09,5.0,1.0,0,0,0,0,0,1,1824
1804,1823,train_Client_10005,2011-07-06,11,777401,0,413,8,1.0,12,...,2011-05-06,4.0,2.0,0,0,0,0,0,1,1823
1805,1822,train_Client_10005,2011-11-03,11,777401,0,413,6,1.0,0,...,2011-07-04,4.0,0.0,0,0,0,0,0,1,1822
1806,1800,train_Client_10005,2012-03-09,11,777401,0,413,8,1.0,0,...,2011-11-08,5.0,1.0,0,0,0,0,0,1,1800
1807,1821,train_Client_10005,2012-07-06,11,777401,0,413,8,1.0,0,...,2012-03-06,4.0,0.0,0,0,0,0,0,1,1821


In [140]:
df_train.loc[df_train['idx'].isin(mask)]

Unnamed: 0,index,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,...,invoice_date_prv_calc,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag,idx
1798,1798,train_Client_10005,2009-05-05,11,777401,1,413,6,1.0,5,...,2009-03-05,,,0,0,0,0,0,1,1798
1799,1807,train_Client_10005,2009-08-27,11,777401,0,413,6,1.0,3,...,2009-04-27,4.0,0.0,0,0,0,0,0,0,1807
1800,1826,train_Client_10005,2010-03-10,11,777401,0,413,6,1.0,19,...,2009-11-08,7.0,3.0,0,0,0,0,0,1,1826
1801,1803,train_Client_10005,2010-07-05,11,777401,0,413,6,1.0,142,...,2010-03-05,4.0,0.0,0,0,0,0,0,1,1803
1802,1825,train_Client_10005,2010-11-02,11,777401,0,413,6,1.0,370,...,2010-07-03,4.0,0.0,0,0,0,0,0,1,1825
1803,1824,train_Client_10005,2011-03-11,11,777401,0,413,9,1.0,407,...,2010-11-09,5.0,1.0,0,0,0,0,0,1,1824
1804,1823,train_Client_10005,2011-07-06,11,777401,0,413,8,1.0,12,...,2011-05-06,4.0,2.0,0,0,0,0,0,1,1823
1805,1822,train_Client_10005,2011-11-03,11,777401,0,413,6,1.0,0,...,2011-07-04,4.0,0.0,0,0,0,0,0,1,1822
1806,1800,train_Client_10005,2012-03-09,11,777401,0,413,8,1.0,0,...,2011-11-08,5.0,1.0,0,0,0,0,0,1,1800
1807,1821,train_Client_10005,2012-07-06,11,777401,0,413,8,1.0,0,...,2012-03-06,4.0,0.0,0,0,0,0,0,1,1821


In [134]:
df_train.iloc[mask]

Unnamed: 0,index,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,...,invoice_date_prv_calc,months_num_calc,months_gap_calc,mtr_flag_bkd,mtr_flag_fwd,date_flag,col_shift_flag,unk_months_num,date_flip_flag,idx
1798,1798,train_Client_10005,2009-05-05,11,777401,1,413,6,1.0,5,...,2009-03-05,,,0,0,0,0,0,1,1798
1807,1821,train_Client_10005,2012-07-06,11,777401,0,413,8,1.0,0,...,2012-03-06,4.0,0.0,0,0,0,0,0,1,1821
1826,1808,train_Client_10005,2018-07-03,11,777401,0,207,9,1.0,333,...,2018-03-03,4.0,0.0,0,0,0,0,0,1,1808
1803,1824,train_Client_10005,2011-03-11,11,777401,0,413,9,1.0,407,...,2010-11-09,5.0,1.0,0,0,0,0,0,1,1824
1825,1815,train_Client_10005,2018-03-07,11,777401,0,207,9,1.0,368,...,2017-11-05,4.0,0.0,0,0,0,0,0,1,1815
1824,1801,train_Client_10005,2017-11-07,11,777401,0,207,9,1.0,465,...,2017-07-08,5.0,1.0,0,0,0,0,0,1,1801
1823,1809,train_Client_10005,2017-07-06,11,777401,0,207,9,1.0,301,...,2017-03-06,4.0,0.0,0,0,0,0,0,1,1809
1822,1810,train_Client_10005,2017-03-07,11,777401,0,207,9,1.0,252,...,2016-11-05,4.0,0.0,0,0,0,0,0,1,1810
1800,1826,train_Client_10005,2010-03-10,11,777401,0,413,6,1.0,19,...,2009-11-08,7.0,3.0,0,0,0,0,0,1,1826
1821,1799,train_Client_10005,2016-11-08,11,777401,0,207,9,1.0,255,...,2016-07-09,5.0,1.0,0,0,0,0,0,1,1799


In [85]:
df_temp = df_temp.set_index('index')

In [None]:
def simple_stable_sort_pairs(df, group_cols, date_col='invoice_date', value_col='mtr_code'):
    # First sort by group columns and date
    df_sorted = df.sort_values(group_cols + [date_col])
        # For each group
    result_dfs = []
    for _, group in df_sorted.groupby(group_cols):
        # Find dates with exactly 2 different values
        date_counts = group.groupby(date_col)[value_col].nunique()
        dates_with_pairs = date_counts[date_counts == 2].index
        values = pair_rows[value_col].unique()
        if len(dates_with_pairs) > 0:
            for date in dates_with_pairs:
                # Get the pair of rows for this date
                mask = (group[date_col] == date)
                pair_rows = group[mask]
                # Get values before and after
                before_value = group[group[date_col] < date].iloc[-1][value_col] if any(group[date_col] < date) else None
                after_value = group[group[date_col] > date].iloc[0][value_col] if any(group[date_col] > date) else None
                # Score each order based on matches with neighbors
                score_order = sum([1 if before_value == values[0] else 0,
                                   1 if after_value == values[1] else 0,
                                   -1 if before_value == values[1] else 0,
                                   -1 if after_value == values[0] else 0])
                # Flip if needed
                if score_order < 0:
                    pair_rows.index = pair_rows.index[::-1]
                    group.loc[mask] = pair_rows.copy()
        result_dfs.append(group)
    return pd.concat(result_dfs)

### Data Coherence

#### "Empty" Rows
Some rows are essentially empty. Either they have no recorded usage nor meter values, or similar. These mainly seem out of place, and do not provide much value. These will be removed unless it would not leave at least one row for each 'client_id' and 'mtr_type'.

'Remove_Rows()' function will handle this task.

In [21]:
def Remove_Rows(df_train, mask, reason, df_removed=None):
    df_train['temp_flag'] = mask
    # Must have at least one record kept for the group
    mask = mask & (df_train.groupby(['client_id', 'mtr_type'], observed=True)['temp_flag'].transform('min') == 0)
    df_temp = df_train[mask].copy()
    df_temp['removed'] = reason
    if df_removed is not None:
        df_removed = pd.concat([df_removed, df_temp.copy()], sort=False)
    else:
        df_removed = df_temp.copy()
    print(f'Rows to be removed / out of rows requested to be removed: {len(df_temp)} / {sum(df_train['temp_flag'])}.')
    df_train = df_train[~mask]
    return (df_train.copy(), df_removed.copy(), df_temp)

##### Focussing on 'mtr_coef':
'mtr_coef' of 0 is rare, and any energy usage in that row does not transfer onwards. These primarily are all equally rare 'mtr_tariff' of 8. It was checked that these were unrelated previous 'mtr_coef' issue, i.e., they were not 0.6 etc.

In [22]:
mask = df_train['mtr_coef'] == 0 # Manual check showed these do not contribute to energy usage over time. They reset each row.
df_train['bad_coef'] = mask.astype(int)
df_train, df_removed, df_temp = Remove_Rows(df_train, mask, 'mtr_empty')
df_temp[col_names].head()

Rows to be removed / out of rows requested to be removed: 45 / 45.


Unnamed: 0,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,usage_N,usage_n,usage_n_calc
179746,0.0,0,0,0,0,0.0,87.0,4.0,0,,87.0
179751,0.0,0,0,0,0,0.0,0.0,4.0,0,,0.0
179755,0.0,0,0,0,0,0.0,633.0,4.0,0,,633.0
997145,0.0,0,0,0,0,0.0,0.0,1.0,0,,0.0
1228208,0.0,0,0,0,0,0.0,0.0,4.0,0,,0.0


##### Focusing on ['usage_{1|2|3|4}', 'mtr_val_{old|new}']: "Empty" Meter
There are some rows where 'n_usage' == 'mtr_val_old' == 'mtr_val_new' == 0.

There are also some rows where 'n_usage' == 0 where 'mtr_val_old' == 'mtr_val_new' != 0. These will not be removed, but will be treated with lower priority if in conflict with another row. This is done by the being sorted by 'usage_n' such that, if all else equal, they'd appear before a row with 'usage_n' > 0, and so, will be enveloped by it.

In [23]:
mask = (df_train[['usage_N', 'mtr_val_old', 'mtr_val_new']] == 0).all(axis=1) # These are considered empty
df_train, df_removed, df_temp = Remove_Rows(df_train, mask, 'mtr_empty', df_removed)
df_temp[col_names].head()
df_train = Calc_Usage(df_train.copy())
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 103849 / 195401.
Rows with unexpected usage values / Total Rows: 16391 / 4372855.
Rows where meter readings seem out of order looking backwards / Total Rows: 11347 / 4372855.
Rows where meter readings seem out of order looking forwards / Total Rows: 11347 / 4372855.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3720 / 4372855. Excluded NAs: 1426.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3749 / 4372855. Excluded NAs: 1426.


#### Intra-Row: Mis-matched Usage
##### "Stuck" Meter:
There seems to be an issue with the 'mtr_val_{old|new}' not reflecting 'usage_n'. The cause is not clear, assumed some sort of stuck meter.
These will be changed if one of two checks are met:
  1. recalculating 'mtr_val_new' using 'mtr_val_old' and 'usage_n' aligns with the next row. 
  2. recalculating 'mtr_val_old' using 'mtr_val_new' and 'usage_n' aligns with the previous row. 

Done sequentially. This is not ideal as some cases will be missed.

In [24]:
# Checking with next row:
mask = ((df_train['mtr_val_old'] == df_train['mtr_val_new']) & (df_train['usage_n'] > 0)) # Meter not changing despite usage
mask_2 = (df_train['mtr_val_old_nxt'] != df_train['mtr_val_new_calc']) | (df_train['mtr_val_old_nxt'].isna()) # Don't trust adjustment
df_train['mtr_stuck'] = 0
df_train.loc[mask & mask_2, 'mtr_stuck'] = -1 # Those that wouldn't be fixed via adjustment
df_train.loc[mask & ~mask_2, 'mtr_stuck'] = 1 # Those being adjusted
df_train.loc[mask & ~mask_2, 'mtr_val_new'] = df_train.loc[mask & ~mask_2, 'mtr_val_new_calc'] # Replacing
print(f"Looking at Next Row: Number of Rows detected: {sum(mask)}.")
print(f"Looking at Next Row: Number of Rows deemed unfixable: {sum(mask & mask_2)}.")
print(f"Looking at Next Row: Number of Rows deemed fixable: {sum(mask & ~mask_2)}.")

# Checking with previous row:
mask = ((df_train['mtr_val_old'] == df_train['mtr_val_new']) & (df_train['usage_n'] > 0)) # Meter not changing despite usage
mask_2 = (df_train['mtr_val_new_prv'] != df_train['mtr_val_old_calc']) | (df_train['mtr_val_new_prv'].isna()) # Don't trust adjustment
df_train.loc[mask & ~mask_2, 'mtr_stuck'] = 1 # Those being adjusted
df_train.loc[mask & ~mask_2, 'mtr_val_new'] = df_train.loc[mask & ~mask_2, 'mtr_val_new_calc'] # Replacing
df_train.loc[(df_train['mtr_stuck'] != 1) & (mask & mask_2), 'mtr_stuck'] = -1 # Those that wouldn't be fixed via adjustment
print(f"Looking at Previous Row: Number of Rows detected: {sum(mask)}.")
print(f"Looking at Previous Row: Number of Rows deemed unfixable: {sum(mask & mask_2)}.")
print(f"Looking at Previous Row: Number of Rows deemed fixable: {sum(mask & ~mask_2)}.")

# Re-sort and check for the same rule again
df_train = Calc_Usage(df_train.copy())
df_train = Calc_Neighbours(df_train.copy(), grp_cols)
df_temp = df_train[df_train['mtr_stuck'] == -1]
col_names = ['mtr_coef', 'usage_N', 'usage_n', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'months_num', 'mtr_val_new_calc']
df_temp[col_names].head()

Looking at Next Row: Number of Rows detected: 7281.
Looking at Next Row: Number of Rows deemed unfixable: 4621.
Looking at Next Row: Number of Rows deemed fixable: 2660.
Looking at Previous Row: Number of Rows detected: 4621.
Looking at Previous Row: Number of Rows deemed unfixable: 4611.
Looking at Previous Row: Number of Rows deemed fixable: 10.
Rows with unexpected usage values / Total Rows: 13721 / 4372855.
Rows where meter readings seem out of order looking backwards / Total Rows: 11348 / 4372855.
Rows where meter readings seem out of order looking forwards / Total Rows: 11348 / 4372855.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3720 / 4372855. Excluded NAs: 1426.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3749 / 4372855. Excluded NAs: 1426.


Unnamed: 0,mtr_coef,usage_N,usage_n,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,months_num,mtr_val_new_calc
1865,1.0,2943,2943.0,7709.0,7709.0,7709.0,7709.0,8.0,10652.0
2865,1.0,5000,5000.0,3143.0,3143.0,3143.0,3143.0,18.0,8143.0
3004,1.0,204,204.0,1469.0,1469.0,1469.0,1469.0,12.0,1673.0
6864,1.0,20,20.0,157.0,157.0,157.0,157.0,4.0,177.0
6918,1.0,44,44.0,15678.0,15702.0,15702.0,15810.0,2.0,15746.0


##### "Phantom" Usage
An almost reverse of the above problem is when there is no apparent usage (usage_N == 0), but 'mtr_val_old' != 'mtr_val_new'.

If these rows are in the expected order in terms of dates and meter values (looking forwards), these instances are fixed. This done very simply by placing it all into 'usage_1'. Otherwise, these are left alone for now.

Often, it is the entire history of that 'mtr_tariff' & 'mtr_id' that has the issue. But sometimes, it is sporadic.

In [25]:
mask = (df_train['mtr_val_old'] != df_train['mtr_val_new']) & (df_train['usage_N'] == 0) # those with the missing usage
mask_2 = (df_train['date_flag'] == 0) & (df_train['mtr_flag_fwd'] == 0) 
df_train['usage_missing'] = (mask & mask_2).astype(int) # Those being adjusted
df_train.loc[(mask & ~mask_2), 'usage_missing'] = -1 # Not being adjusted
df_train.loc[(mask & mask_2), 'usage_1'] = df_train.loc[(mask & mask_2), 'usage_n_calc'] * df_train.loc[(mask & mask_2), 'mtr_coef'] # Want unscaled value
print(f"Rows where meter changes whilst no usage is recorded, those adjusted / All possibly affected : {sum(mask & mask_2)} / {sum(mask)}.")
df_train = Calc_Usage(df_train.copy())
df_temp = df_train[(mask & ~mask_2)]
col_names = ['mtr_coef', 'usage_N', 'usage_n', 'usage_n_calc', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'months_num']
df_temp[col_names].head()

Rows where meter changes whilst no usage is recorded, those adjusted / All possibly affected : 2126 / 2154.
Rows with unexpected usage values / Total Rows: 11595 / 4372855.


Unnamed: 0,mtr_coef,usage_N,usage_n,usage_n_calc,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,months_num
129959,1.0,0,0.0,45.0,,0.0,45.0,0.0,2.0
240538,1.0,0,0.0,496.0,1192.0,1647.0,2143.0,1647.0,2.0
258541,1.0,0,0.0,7.0,,5194.0,5201.0,5194.0,2.0
806909,1.0,0,0.0,1037.0,,232.0,1269.0,152.0,4.0
829641,1.0,0,0.0,141.0,,55234.0,55375.0,55234.0,2.0


#### Inter-Row: Mis-matched Usage
##### Meter "Re-Write" (1 of 2):
It seems that sometimes, the meter is "reset" to retroactively adjust values. A complication is that this somewhat overlaps with a seperate issue of meter "roll over" when it exceeds maximum number of digits.

The signs of this happening are:
1. 'invoice_date' - 'months_num' >= Prev('invoice_date')
2. 'mtr_val_old' >= 'mtr_val_new_prv' (alternatively, 'mtr_val_new' <= 'mtr_val_old_nxt')

Not all of those breaking the first rule seem out of order, however, neither does removing the row introduce an issue. The 'months_num' seems quite coarse so have to allow for rounding values.

It will typically be the row prior to the one that is "reset" that were "re-written" and will be removed to restore order.

This is split into two steps, in an attempt to disentangle it from the meter "roll over" issue. The logic used is quite repetitive, this is so different cases could be treated differently in the future if wanted. Otherwise, could be consolidated.

The general logic is in the 'Calc_Overlap()' function. Where a value (date or meter reading) relative to the prior row seems out of place, it is first flagged. Any row prior to this, that is out of order of said row, is then flagged for removal. If a row would be out of order compared to multiple rows, precedence is given to the most onerous case (i.e., the earliest date or lowest meter reading).

In [26]:
def Calc_Overlap(df_train, mask, col_name):
    df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].transform('shift', periods=-1) # Flag row before the newly overlapping row
    # Sometimes, one flagged instance is subsumed by another, using cumulative min in reverse order to override these. 
    df_temp = df_train[grp_cols + ['temp_flag']].dropna(subset=['temp_flag']) # Broken off into temp df to speed up as most as NaT values.
    df_temp['temp_flag'] = df_temp.groupby(grp_cols, observed=True)['temp_flag'].apply(lambda x: x[::-1].cummin()[::-1]).to_list() # Keep earliest value mentioned so far, per group, in reverse order
    df_train.loc[df_temp.index, 'temp_flag'] = df_temp['temp_flag'] # Bring it back in
    df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].bfill() # Flag all previous within the group
    df_train['temp_flag'] = (df_train[col_name] > df_train['temp_flag']) & (df_train['temp_flag'].notna()) # Flag only those within the overlapping period
    mask = df_train['temp_flag'] == True
    return df_train, mask

Looking at those breaking both rules.

In [27]:
#df_train_save, df_removed_save = df_train.copy(), df_removed.copy()
df_train, df_removed = df_train_save.copy(), df_removed_save.copy()

In [28]:
mask = (df_train['date_flag'] == 1) # overlapping months
mask_2 = (df_train['mtr_flag_bkd'] == 1) | (df_train['mtr_flag_fwd'] == 1) # overlapping meters
print(f"Rows where meter readings also seem to be reset / Rows that seem out of order: {sum(mask & mask_2)} / {sum(mask)}.")
mask = mask & mask_2
df_train['temp_flag'] = pd.to_datetime(np.nan)
df_train.loc[mask, 'temp_flag'] = df_train.loc[mask, 'invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') # This is the date being sought after
df_train, mask = Calc_Overlap(df_train, mask, 'invoice_date') # Flag rows within overlapping period of subsequent rows
print(f"Rows affected by reset meter readings / total Rows: {sum(mask)} / {len(df_train)}.")
df_temp = df_train[mask]
col_names = ['invoice_date_prv', 'invoice_date_prv_calc', 'months_num', 'invoice_date', 'mtr_val_old_prv', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'mtr_val_new_nxt']
df_temp[col_names].head()

Rows where meter readings also seem to be reset / Rows that seem out of order: 318 / 3749.
Rows affected by reset meter readings / total Rows: 447 / 4372855.


Unnamed: 0,invoice_date_prv,invoice_date_prv_calc,months_num,invoice_date,mtr_val_old_prv,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,mtr_val_new_nxt
1812,2013-11-14,2014-01-02,2.0,2014-03-04,1311.0,1571.0,1571.0,1796.0,1571.0,1571.0
9411,2007-12-17,2007-12-06,2.0,2008-02-05,0.0,23475.0,23475.0,23475.0,22348.0,23475.0
23100,2010-12-14,2011-02-08,4.0,2011-06-10,3377.0,3745.0,3745.0,3745.0,0.0,4471.0
26942,2007-09-26,2007-12-09,2.0,2008-02-08,11395.0,11397.0,11771.0,11771.0,11397.0,11771.0
57251,2009-01-05,2009-08-29,2.0,2009-10-29,6666.0,6666.0,6666.0,6666.0,6609.0,6609.0


In [29]:
# Remove those flagged
df_train['mtr_reset'] = mask.astype(int)
df_train, df_removed, df_temp = Remove_Rows(df_train.copy(), mask, 'mtr_reset', df_removed.copy())
df_temp[col_names].head()
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 447 / 447.
Rows where meter readings seem out of order looking backwards / Total Rows: 11104 / 4372408.
Rows where meter readings seem out of order looking forwards / Total Rows: 11104 / 4372408.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3406 / 4372408. Excluded NAs: 1425.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3435 / 4372408. Excluded NAs: 1425.


Looking at those breaking rule 1 but not rule 2:
* If 
* If removing the necessary rows prior leads to perfect meter alignment ('mtr_val_old' == 'mtr_val_new_prv'), do so
* If meter currently in order but not in perfect alignment:
    * If removing the necessary rows prior leads to being in order ('mtr_val_old' > 'mtr_val_new_prv'), do so

The logic here is to not worsen the state of alignment by the changes.

In [None]:
# Looking at those currently in perfect meter alignment
mask = (df_train['date_flag'] == 1) & (((df_train['mtr_val_new'] == df_train['mtr_val_old_nxt']) | (df_train['mtr_val_old_nxt'].notna())) |
                                       ((df_train['mtr_val_old'] == df_train['mtr_val_new_prv']) | (df_train['mtr_val_new_prv'].notna())))
df_train['temp_flag'] = pd.to_datetime(np.nan)
df_train.loc[mask, 'temp_flag'] = df_train.loc[mask, 'invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') # This is the date being sought after
df_train, mask = Calc_Overlap(df_train, mask, 'invoice_date') # Flag rows within overlapping period of subsequent rows
# Want to check whether removing these improves alignment.
df_temp = df_train[df_train['temp_flag'] == False].copy() # Comparison case
print("vvv Ignore outputs below! vvv")
df_temp = Calc_Neighbours(df_temp.copy(), grp_cols) # Update values
print("^^^ Ignore outputs above! ^^^")
df_temp['temp_flag_2'] = df_temp['mtr_val_old'] != df_temp['mtr_val_new_prv'] # Not perfect alignment
df_train['temp_flag_2'] = df_train['mtr_val_old'] != df_train['mtr_val_new_prv'] 
# Only those that reduces number of rows that have non-perfect alignment kept
mask = (df_temp.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum') < df_train.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum')).reset_index()
mask = mask[mask['temp_flag_2'] == True]
df_train['temp_flag_2'] = df_train.set_index(grp_cols).index.isin(mask.set_index(grp_cols).index) # Flagging the groups where it was improvement
mask = (df_train['temp_flag'] == True) & (df_train['temp_flag_2'] == True)
print(f'Rows with overlapping periods that would otherwise align if removed / Total overlapping rows: {sum(mask)} / {sum(df_train['temp_flag'] == 1)}.')

vvv Ignore outputs below! vvv
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4368964.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4368964.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 0 / 4368964. Excluded NAs: 1425.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 30 / 4368964. Excluded NAs: 1425.
^^^ Ignore outputs above! ^^^
Number of rows with overlapping periods that would otherwise align if removed / out total overlapping rows: 162 / 3444.


In [31]:
# Remove those flagged
df_train.loc[mask, 'mtr_reset'] = 1
df_train, df_removed, df_temp = Remove_Rows(df_train.copy(), mask, 'mtr_reset', df_removed.copy())
df_temp[col_names].head()
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 162 / 162.
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4372246.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4372246.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3259 / 4372246. Excluded NAs: 1425.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3288 / 4372246. Excluded NAs: 1425.


In [32]:
# Looking at those currently in order but not in perfect meter alignment
mask =  (df_train['date_flag'] == 1) & (((df_train['mtr_val_new'] < df_train['mtr_val_old_nxt']) | (df_train['mtr_val_old_nxt'].notna())) |
                                       ((df_train['mtr_val_old'] > df_train['mtr_val_new_prv']) | (df_train['mtr_val_new_prv'].notna())))
df_train['temp_flag'] = pd.to_datetime(np.nan)
df_train.loc[mask, 'temp_flag'] = df_train.loc[mask, 'invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') # This is the date being sought after
df_train, mask = Calc_Overlap(df_train, mask, 'invoice_date') # Flag rows within overlapping period of subsequent rows
# Want to check whether removing these causes good alignment.
df_temp = df_train[df_train['temp_flag'] == False].copy() # Comparison case
print("vvv Ignore outputs below! vvv")
df_temp = Calc_Neighbours(df_temp.copy(), grp_cols) # Update values
print("^^^ Ignore outputs above! ^^^")
df_temp['temp_flag_2'] = df_temp['mtr_val_old'] < df_temp['mtr_val_new_prv']  # Not in order (previous test was for perfect alignment)
df_train['temp_flag_2'] = df_train['mtr_val_old'] > df_train['mtr_val_new_prv'] 
# Only those that reduces number of misaligned rows kept
mask = (df_temp.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum') < df_train.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum')).reset_index()
mask = mask[mask['temp_flag_2'] == True]
df_train['temp_flag_2'] = df_train.set_index(grp_cols).index.isin(mask.set_index(grp_cols).index) # Flagging the groups where it was improvement
mask = (df_train['temp_flag'] == True) & (df_train['temp_flag_2'] == True)
print(f'Rows with overlapping periods that stay aligned if removed / Total overlapping rows: {sum(mask)} / {sum(df_train['temp_flag'] == 1)}.')

vvv Ignore outputs below! vvv
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4368964.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4368964.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 0 / 4368964. Excluded NAs: 1425.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 30 / 4368964. Excluded NAs: 1425.
^^^ Ignore outputs above! ^^^
Number of rows with overlapping periods that stay aligned if removed / out total overlapping rows: 2159 / 3282.


In [33]:
# Remove those flagged
df_train.loc[mask, 'mtr_reset'] = 1
df_train, df_removed, df_temp = Remove_Rows(df_train.copy(), mask, 'mtr_reset', df_removed.copy())
df_temp[col_names].head()
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 2159 / 2159.
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4370087.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4370087.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 1119 / 4370087. Excluded NAs: 1425.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 1149 / 4370087. Excluded NAs: 1425.


Note that this leaves behind two sets of cases:
1. Those with overlapping invoice spans but with meters in order: where removing rows degrades alignment. 
   * Sub-set of breaking rule 1 but not rule 2.
2. Those that do not have overlapping 'months_num' but with meters out of order. 
   * Breaking rule 2 but not rule 1.

Will return to these later.

#### "Reset" Meter: Rollover
There is also a "rollover" issue with the 'mtr_val_{old|new}' columns, where some reset after exceeding 100000 | 1000000 back down to 0. Two scenarios:
* Rollover occurs within a row (intra-row)
* Rollover occurs between a row (it is assumed this requires a "gap" in dates) (inter-row)

One complication is that it is not clear what the cap is for a given meter, and this has to be inferred based on what caused the rollover. Where possible, looking to add amount to fix issue.

##### Inter-Row:
Unfortunately, it seems quite hard to distinguish meter rolling over from a meter re-write at times. The settled upon rules are:
* Sufficient Time:
   * Must have unaccounted gap in invoice spans in which meter could have rolled over.
* Sufficient Usage:
   * Previous month's monthly usage projected forwards into the gap must be enough to increase number of digits.
   * OR, Current month's monthly usage project backwards into the gap must be enough to cause negative values.
* Qualifiers:
   * The alleged rollover number of digits must be greater than seen prior in the group:
       * For example, cannot claim a rollover at 10000 if there exists a meter readings >10000 already.
       * It must also be greater than assumed minimum: 9999.
   * The relied upon monthly usage cannot be too abnormal:
       * Within (Median/3) < usage < (Median*3), ignoring those with 0 usage.
   * Meter must not be locked at 0 for both mtr_val_old and mtr_val_new.

In [None]:
# Store max digits seen for a group, assumed constant and that rollover could not happen less than this.
df_train['mtr_max_digit'] = df_train[['mtr_val_old', 'mtr_val_new']].max(axis=1) # value per row
df_train['mtr_max_digit'] = df_train.groupby(grp_cols, observed=True)[['mtr_max_digit']].transform('max') # value per group
df_train['mtr_max_digit'] = 10 ** np.ceil(np.log10(df_train['mtr_max_digit'] + 1)) # Number of digits
# Looking at rollover across rows
df_train['monthly_usage_grp_median'] = df_train['monthly_usage'].replace(0, np.nan)
df_train['monthly_usage_grp_median'] = df_train.groupby(grp_cols, observed=True)['monthly_usage_grp_median'].transform('median')
mask = ((df_train['months_gap_calc'] > 1) & # Must have a gap
        (df_train['usage_flag'] == 0) & # Usage must make sense
        (df_train['monthly_usage'] < (df_train['monthly_usage_grp_median'] * 3)) & # Usage not more than triple median 
        ((df_train['monthly_usage'] * 3) > df_train['monthly_usage_grp_median'])) # Usage not less than third median
# Looking if current usage would have caused rollover going backwards:
df_train['temp_flag'] = np.nan
df_train.loc[mask, 'temp_flag'] = (df_train.loc[mask, 'monthly_usage'] * df_train.loc[mask, 'months_gap_calc']).round(0) # Projected usage
mask_2 = ((df_train['mtr_val_old'] < df_train['temp_flag']) & # Had enough (gap * usage) to have caused rollover
          (df_train['mtr_val_new_prv'] + df_train['temp_flag'] > df_train['mtr_max_digit']) & # Would have caused highest digit count seen
          (df_train['mtr_val_new_prv'] + df_train['temp_flag'] > 9999) & # Minimum rollover point assumed
          (((df_train['mtr_val_new_prv'] + df_train['temp_flag']) - (df_train['mtr_val_old'] + df_train['mtr_max_digit'])).abs() < 2000))
# Looking if previous usage would have caused rollover going forwards:
df_train['temp_flag'] = np.nan
df_train.loc[mask, 'temp_flag'] = (df_train.loc[mask, 'monthly_usage_prv'] * df_train.loc[mask, 'months_gap_calc']).round(0) # Projected usage
mask = ((df_train['mtr_val_new_prv'] + df_train['temp_flag'] > df_train['mtr_max_digit']) & # Would have caused highest digit count seen (and so also rollover)
        (df_train['mtr_val_new_prv'] + df_train['temp_flag'] > 9999) & # Minimum rollover point assumed
        (((df_train['mtr_val_new_prv'] + df_train['temp_flag']) - (df_train['mtr_val_old'] + df_train['mtr_max_digit'])).abs() < 2000))
print(f"Rows suspected of inter-row rollover projecting backwards / those projecting forwards: {sum(mask_2)} / {sum(mask)}.")
# Ignore those that have mtr_val_old and mtr_val_new == 0
mask = (mask | mask_2) 
mask_2 = (df_train['mtr_val_old'] == 0) & (df_train['mtr_val_new'] == 0) 
print(f"Rows suspected of inter-row rollover, deemed fixable / Rows suspected: {sum(mask & ~mask_2)} / {sum(mask)}.")
df_train['mtr_old_rollover'] = (mask & ~mask_2).astype(int)
df_train.loc[(mask & mask_2), 'mtr_old_rollover'] = -1
col_names = ['client_id', 'invoice_date', 'usage_n', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'months_num', 'mtr_val_new_calc', 'mtr_max_digit']
df_temp = df_train[(mask & mask_2) if sum(mask & mask_2)>0 else (mask & ~mask_2)] # Showing those unfixable if they exist, otherwise, just those flagged
df_temp[col_names]

Rows suspected of inter-row rollover projecting backwards / those projecting forwards: 92 / 152.
Rows suspected of inter-row rollover, deemed fixable / Rows suspected: 185 / 185.


Unnamed: 0,client_id,invoice_date,usage_n,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,months_num,mtr_val_new_calc,mtr_max_digit
1643,train_Client_100044,2012-12-11,492.0,9977.0,219.0,711.0,711.0,4.0,711.0,10000.0
9094,train_Client_100257,2010-01-26,2171.0,8992.0,2860.0,5031.0,7420.0,4.0,5031.0,10000.0
25836,train_Client_100717,2011-10-13,556.0,9537.0,1429.0,1985.0,1985.0,4.0,1985.0,10000.0
132114,train_Client_103569,2012-12-17,2971.0,85269.0,1305.0,4276.0,,4.0,4276.0,100000.0
134581,train_Client_103636,2012-04-13,127.0,8863.0,3802.0,3929.0,4116.0,2.0,3929.0,10000.0
...,...,...,...,...,...,...,...,...,...,...
4277887,train_Client_94567,2016-04-22,12682.0,997907.0,7745.0,20427.0,20427.0,6.0,20427.0,1000000.0
4306331,train_Client_95331,2015-05-04,220.0,6501.0,2925.0,3145.0,3145.0,4.0,3145.0,10000.0
4400731,train_Client_97944,2009-08-04,4371.0,97197.0,1478.0,5849.0,5849.0,1.0,5849.0,100000.0
4407063,train_Client_98106,2014-05-26,704.0,9794.0,2066.0,2770.0,2770.0,2.0,2770.0,10000.0


##### Intra-Row:
Rules for these:
* Expected values must exceed maximum number of digits seen in a group, and exceed 9999
* Monthly usage cannot be too abnormal; within: (Median/3) < usage < (Median*3), ignoring those with 0 usage.

In [42]:
mask = ((df_train['mtr_val_new_calc'] > df_train['mtr_max_digit']) & # Looking where number of digits would have exceeded most seen
        (df_train['mtr_val_new_calc'] > 9999) & # Minimum rollover point assumed
        (df_train['monthly_usage'] < (df_train['monthly_usage_grp_median'] * 3)) & # Usage not more than triple median 
        ((df_train['monthly_usage'] * 3) > df_train['monthly_usage_grp_median'])) # Usage not less than third median
# From these, only going to make the change if they don't cause alignment issue to their next row after change, and those without blank meter values
mask_2 = ((((df_train['mtr_val_new_calc'] - df_train['mtr_max_digit']) <= df_train[ 'mtr_val_old_nxt']) |
          (df_train['mtr_val_old_nxt'].isna())) &
          ((df_train['mtr_val_old'] != 0) | (df_train['mtr_val_new'] != 0)))

print(f"Rows suspected of intra-row rollover, those deemed fixable / Rows suspected: {sum(mask & mask_2)} / {sum(mask)}.")
df_train['mtr_new_rollover'] = mask.astype(int)
df_train.loc[(mask & (~mask_2)), 'mtr_new_rollover'] = -1 
df_temp = df_train[(mask & (~mask_2)) if sum(mask & (~mask_2))>0 else (mask & mask_2)] # Showing those unfixable if they exist, otherwise, just those flagged
df_temp[col_names]

Rows suspected of intra-row rollover, those deemed fixable / Rows suspected: 1974 / 1978.


Unnamed: 0,client_id,invoice_date,usage_n,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,months_num,mtr_val_new_calc,mtr_max_digit
2045173,train_Client_33768,2011-11-16,14111.0,,0.0,0.0,,4.0,14111.0,1.0
2738687,train_Client_52765,2009-05-15,18287.0,1674.0,1674.0,1674.0,1674.0,32.0,19961.0,10000.0
3445301,train_Client_72015,2007-11-08,18117.0,,0.0,0.0,,4.0,18117.0,1.0
3721211,train_Client_79521,2007-01-23,20043.0,2226.0,2226.0,2226.0,2226.0,16.0,22269.0,10000.0


Once a rollover occurs, it has a knock-on effects. One issue is that multiple rollovers can occur. 
* Inter-Row: 'mtr_old_rollover':
  * 'mtr_val_old' and 'mtr_val_new' needs adjustment every instance.
* Intra-Row: 'mtr_new_rollover':
  * The first instance it occurs, only 'mtr_val_new' needs adjusting, but then all rows after need both that and 'mtr_val_old' adjusting.

The expected 'mtr_val_new_calc' is recalculated after this once more.

In [43]:
# Keeping a track of change made to mtr_val_old and mtr_val new. Combining both mtr_old_rollover and mtr_new_rollover before making the change.
df_train[['mtr_old_adj_quant', 'mtr_new_adj_quant']] = 0
col_names = ['temp_flag', 'temp_flag_2']
for mask_name in ['mtr_old_rollover', 'mtr_new_rollover']:
    mask = df_train[mask_name] == 1
    df_train[col_names] = np.nan
    df_train.loc[mask, col_names] = df_train.loc[mask,'mtr_max_digit']
    df_train.loc[mask, col_names] = df_train[mask].groupby(grp_cols, observed=True)[col_names].cumsum() # Count number of instances
    df_train[col_names] = df_train.groupby(grp_cols, observed=True)[col_names].ffill()
    df_train[col_names] = df_train[col_names].fillna(0)
    df_train['mtr_old_adj_quant'] = df_train['mtr_old_adj_quant'] + df_train['temp_flag']
    df_train['mtr_new_adj_quant'] = df_train['mtr_new_adj_quant'] + df_train['temp_flag_2']
    print(f"Rows updated due to {mask_name}: {sum(df_train['temp_flag_2'] > 0)}.")
mask = df_train['mtr_new_rollover'] == 1
df_train.loc[mask,'mtr_old_adj_quant'] = df_train.loc[mask,'mtr_old_adj_quant'] - df_train.loc[mask,'mtr_max_digit'] # Ignore first instance for new_rollover
# Update the values
df_train['mtr_val_old'] = df_train['mtr_val_old'] + df_train['mtr_old_adj_quant']
df_train['mtr_val_new'] = df_train['mtr_val_new'] + df_train['mtr_new_adj_quant']

Rows updated due to mtr_old_rollover: 1796.
Rows updated due to mtr_new_rollover: 26218.


In [44]:
# Re-sort and check for the same rule again
df_train = Calc_Usage(df_train.copy())
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows with unexpected usage values / Total Rows: 8350 / 4370087.
Rows where meter readings seem out of order looking backwards / Total Rows: 10903 / 4370087.
Rows where meter readings seem out of order looking forwards / Total Rows: 10903 / 4370087.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 1119 / 4370087. Excluded NAs: 1425.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 1149 / 4370087. Excluded NAs: 1425.


#### Inter-Row: Mis-matched Usage (Continued)
##### Meter "Re-Write" (2 of 2):
Returning to the two sets of cases left previously:
1. Those with overlapping invoice spans but with meters in order: where removing rows degrades alignment. 
   * Sub-set of breaking rule 1 but not rule 2.
2. Those that do not have overlapping 'months_num' but with meters out of order. 
   * Breaking rule 2 but not rule 1.

This section will be a lot more coercive due to time constraints - need to get them in order, even if sacrificing some of the data to do so. Based on this exploration, it seems clear that no column with be ideal to use. 'mtr_val_new' was deemed the overall best metric to rely upon as it should be aligned directly with 'invoice_date', which is assumed always correct.

Looking at case 2: those without overlapping 'months_num' but with meters out of order. Could look to do something about those that have a more "hard" reset, i.e. has consistent history, and then starts again from 0 for another consistent history. For example, these could be split into new different pseudo-'mtr_id' entities. However, given how rare it is and time constraints, this was not done for simplicity here. Similarly, some of these may have been modified to have only the violating data imputed versus being deleted outright.

In [48]:
mask = ((df_train['mtr_flag_bkd'] != 0)) & (df_train['date_flag'] == 0)
df_train['temp_flag'] = np.nan
df_train.loc[mask, 'temp_flag'] = df_train.loc[mask, 'mtr_val_old'] # This is value to stay under
df_train, mask = Calc_Overlap(df_train, mask, 'mtr_val_new') # Flag rows within overlapping period of subsequent rows
print(f'Rows with out of order meter values / Total affected overlapping rows: {sum(((df_train['mtr_flag_bkd'] != 0)) & (df_train['date_flag'] == 0))} / {sum(mask)}.')

Rows with out of order meter values / Total affected overlapping rows: 10898 / 17817.


In [49]:
# Remove those flagged
df_train.loc[mask, 'mtr_reset'] = 1
df_train, df_removed, df_temp = Remove_Rows(df_train.copy(), mask, 'mtr_reset', df_removed.copy())
df_temp[col_names].head()
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 17817 / 17817.
Rows where meter readings seem out of order looking backwards / Total Rows: 3 / 4352270.
Rows where meter readings seem out of order looking forwards / Total Rows: 3 / 4352270.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 1114 / 4352270. Excluded NAs: 1419.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 1142 / 4352270. Excluded NAs: 1419.


In [51]:
mask = ((df_train['mtr_val_old'] < df_train['mtr_val_new_prv']) & (df_train['mtr_val_new_prv'].notna()))
print(f"Rows where meter readings seem out of order looking backwards: {sum(mask)}. Excluded NAs: {sum(df_train['mtr_val_new_prv'].isna())}.")
df_temp = df_train[mask]

Rows where meter readings seem out of order looking backwards: 3. Excluded NAs: 228999.


In [52]:
grp_cols

['client_id', 'mtr_type', 'mtr_tariff', 'mtr_id']

In [None]:
def Calc_Neighbours(df_train, grp_cols=grp_cols):
    rows = len(df_train)
    # Sort, Group, Find neighbours
    df_train.sort_values(by=grp_cols+['invoice_date', 'mtr_code', 'usage_n'], inplace=True)
    df_grp = df_train.groupby(grp_cols, observed=True)
    df_train['mtr_val_new_prv_2'] = df_grp['mtr_val_new'].shift(2)
    col_names = ['mtr_val_old', 'mtr_val_new', 'invoice_date', 'months_num', 'monthly_usage']
    df_train[[col_name+'_prv' for col_name in col_names]] = (df_grp[col_names].shift(1))
    df_train[[col_name+'_nxt' for col_name in col_names[:2]]] = (df_grp[col_names[:2]].shift(-1))
    df_train['mtr_val_old_nxt_2'] = df_grp['mtr_val_old'].shift(-2)
    # Calculate expected metrics based on relationships
    df_train['invoice_date_prv_calc'] = pd.NaT # If months_num is unreasonable it would crash
    mask = (df_train['months_num'] > 0) & (df_train['months_num'] < 600) & (df_train['months_num'].notna()) # 0 - 50 Years
    df_train['invoice_date_prv_calc'] = df_train.loc[mask, 'invoice_date'] - pd.to_timedelta(df_train.loc[mask, 'months_num'].astype(int) * 30.5, unit='days')
    df_train['months_num_calc'] = np.ceil(((df_train['invoice_date'] - df_train['invoice_date_prv']).dt.days.values / 30.5))
    df_train['months_gap_calc'] = df_train['months_num_calc'] - df_train['months_num'] # Difference in dates in months
    # Check Rules
    mask = ((df_train['mtr_val_old'] < df_train['mtr_val_new_prv']) & (df_train['mtr_val_new_prv'].notna()))
    df_train['mtr_flag_bkd'] = mask.astype(int)
    print(f"Rows where meter readings seem out of order looking backwards / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['mtr_val_new_prv'].isna())}.")
    mask = ((df_train['mtr_val_new'] > df_train['mtr_val_old_nxt']) & (df_train['mtr_val_old_nxt'].notna()))
    df_train['mtr_flag_fwd'] = mask.astype(int)
    print(f"Rows where meter readings seem out of order looking forwards / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['mtr_val_old_nxt'].isna())}.")
    mask = ((df_train['invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') < df_train['invoice_date_prv']) & (df_train['invoice_date_prv_calc'].notna()))
    df_train['date_flag'] = mask.astype(int)
    print(f"Rows where invoice spans seem to overlap based on \'months_num\' / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['months_num'].isna())}.")
    mask = ((df_train['months_num_calc'] < df_train['months_num']) & (df_train['invoice_date_prv_calc'].notna()))
    df_train.loc[mask, 'date_flag'] = 1
    print(f"Rows where invoice spans seem to overlap based on \'invoice_date\'s / Total Rows: {sum(mask)} / {rows}. Excluded NAs: {sum(df_train['months_num'].isna())}.")
    return df_train.copy()

In [69]:
df_temp = df_train.sort_values(by=grp_cols + ['invoice_date', 'mtr_code'], kind='stable').copy()
df_temp = df_temp.sort_values(by=grp_cols + ['invoice_date'],  kind='stable')

MemoryError: Unable to allocate 863. MiB for an array with shape (26, 4352270) and data type float64

In [64]:
df_temp = df_temp.sort_values(by=grp_cols + ['invoice_date'])