# Energy Fraud Detection

Imports for notebook:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
import seaborn as sns
%matplotlib inline

#from itables import init_notebook_mode
#init_notebook_mode(all_interactive=True)

## Introduction

Looking at dataset of clients and their energy usage over time. The target variable is fraud, and is labelled on a per-client basis.

This is an interesting problem where data is partially corrupted, and has major inconsistencies. Furthermore, there is insufficient information to obtain target variable for all cases.

This notebook conducts the initial data cleaning, where the raw datasets are input, and the 'cleaned' datasets are output.

When rows are removed, they are placed in the df_removed_{train|test} dataset. At least one row for every 'client_id' is needed.

Where relevant, the last row will be kept for the 'client_id' if all would otherwise be removed.

Please refer to the main notebook, "???", for further details and continuation.

## Data Cleaning

### Data Import

In [2]:
# Read the CSV files
df_client_test = pd.read_csv('./client_test.csv', on_bad_lines='skip')
df_client_train = pd.read_csv('./client_train.csv', on_bad_lines='skip')
df_invoice_test = pd.read_csv('./invoice_test.csv', on_bad_lines='skip')
# low_memory is prompted due to unexpected values and large datasize
df_invoice_train = pd.read_csv('./invoice_train.csv', on_bad_lines='skip', low_memory=False)
df_SampleSubmission = pd.read_csv('./SampleSubmission (2).csv', on_bad_lines='skip')

In [3]:
df_SampleSubmission.head()

Unnamed: 0,client_id,target
0,test_Client_0,0.957281
1,test_Client_1,0.996425
2,test_Client_10,0.612359
3,test_Client_100,0.776933
4,test_Client_1000,0.571046


In [4]:
df_client_test.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date
0,62,test_Client_0,11,307,28/05/2002
1,69,test_Client_1,11,103,06/08/2009
2,62,test_Client_10,11,310,07/04/2004
3,60,test_Client_100,11,101,08/10/1992
4,62,test_Client_1000,11,301,21/07/1977


In [5]:
df_invoice_test.head()

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,test_Client_0,2018-03-16,11,651208,0,203,8,1,755,0,0,0,19145,19900,8,ELEC
1,test_Client_0,2014-03-21,11,651208,0,203,8,1,1067,0,0,0,13725,14792,8,ELEC
2,test_Client_0,2014-07-17,11,651208,0,203,8,1,0,0,0,0,14792,14792,4,ELEC
3,test_Client_0,2015-07-13,11,651208,0,203,9,1,410,0,0,0,16122,16532,4,ELEC
4,test_Client_0,2016-07-19,11,651208,0,203,9,1,412,0,0,0,17471,17883,4,ELEC


In [6]:
print(f"Number of rows in client train vs invoice train: {len(df_client_train)} vs {len(df_invoice_train)}")
print(f"Number of unique client_id in client train vs invoice train: {df_client_train['client_id'].nunique()} vs {df_invoice_train['client_id'].nunique()}")
print(f"Number of rows in client test vs invoice test: {len(df_client_test)} vs {len(df_invoice_test)}")
print(f"Number of unique client_id in client test vs invoice test: {df_client_test['client_id'].nunique()} vs {df_invoice_test['client_id'].nunique()}", end="")

Number of rows in client train vs invoice train: 135493 vs 4476749
Number of unique client_id in client train vs invoice train: 135493 vs 135493
Number of rows in client test vs invoice test: 58069 vs 1939730
Number of unique client_id in client test vs invoice test: 58069 vs 58069

Going to merge df_client_train and df_invoice_train:

In [7]:
df_test = df_invoice_test.join(df_client_test.set_index('client_id'), on='client_id', validate='m:1').copy()
df_train = df_invoice_train.join(df_client_train.set_index('client_id'), on='client_id', validate='m:1').copy()
df_train

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,...,consommation_level_4,old_index,new_index,months_number,counter_type,disrict,client_catg,region,creation_date,target
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,...,0,14302,14384,4,ELEC,60,11,101,31/12/1994,0.0
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,...,0,12294,13678,4,ELEC,60,11,101,31/12/1994,0.0
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,...,0,14624,14747,4,ELEC,60,11,101,31/12/1994,0.0
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,...,0,14747,14849,4,ELEC,60,11,101,31/12/1994,0.0
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,...,0,15066,15638,12,ELEC,60,11,101,31/12/1994,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476744,train_Client_99998,2005-08-19,10,1253571,0,202,9,1,400,135,...,0,3197,3732,8,ELEC,60,11,101,22/12/1993,0.0
4476745,train_Client_99998,2005-12-19,10,1253571,0,202,6,1,200,6,...,0,3732,3938,4,ELEC,60,11,101,22/12/1993,0.0
4476746,train_Client_99999,1996-09-25,11,560948,0,203,6,1,259,0,...,0,13884,14143,4,ELEC,60,11,101,18/02/1986,0.0
4476747,train_Client_99999,1996-05-28,11,560948,0,203,6,1,603,0,...,0,13281,13884,4,ELEC,60,11,101,18/02/1986,0.0


In [8]:
del df_invoice_train, df_invoice_test, df_client_train, df_client_test, df_SampleSubmission

### Renaming Columns and setting Data Types

There are some comments from the source regarding the meaning of these columns. These included here verbatim.
* "client_train.csv":
* "disrict: District where the client is"
* "client_id: Unique id for client"
* "client_catg: Category client belongs to"
* "region: Area where the client is"
* "creation_date: Date client joined"
* "target: fraud:1, not fraud: 0"
* "invoice_train.csv":
* "client_id: Unique id for client"
* "invoice_date: Date of the invoice"
* "tarif_type: Type of tax"
* "counter_number: number"
* "counter_statue: akes up to 5 values such as working fine, not working, on hold statue, ect"
* "counter_code: code"
* "reading_remarque: notes that the STEG agent takes during his visit to the cleint (e.g.: if the counter shows something wrong, the"
* "counter_coefficient: An additional coefficient to be added when standard consumption is exceeded"
* "consommation_level_1: Consumption_level_1"
* "consommation_level_2: Consumption_level_2"
* "consommation_level_3: Consumption_level_3"
* "consommation_level_4: Consumption_level_4"
* "old_index: Old index"
* "new_index: New index"
* "months_number: Month number"
* "counter_type: Type of counter"

Going to rename the columns slightly to better match my interpretation of the data gained from inspection.

In [9]:
rename_dict = {'client_id' : 'client_id', 
               'invoice_date' : 'invoice_date', 
               'tarif_type' : 'mtr_tariff', 
               'counter_number' : 'mtr_id',
               'counter_statue' : 'mtr_status', 
               'counter_code' : 'mtr_code', 
               'reading_remarque' : 'mtr_notes',
               'counter_coefficient' : 'mtr_coef', 
               'consommation_level_1' : 'usage_1', 
               'consommation_level_2' : 'usage_2',
               'consommation_level_3' : 'usage_3', 
               'consommation_level_4' : 'usage_4', 
               'old_index' : 'mtr_val_old',
               'new_index' : 'mtr_val_new', 
               'months_number': 'months_num', 
               'counter_type' : 'mtr_type', 
               'disrict' : 'district', 
               'client_catg' : 'client_type',
               'region' : 'region', 
               'creation_date' : 'start_date', 
               'target' : 'fraud'}

df_test.rename(columns=rename_dict, inplace=True) # Note that test won't have target
df_train.rename(columns=rename_dict, inplace=True)

Converting data types where appropriate:

In [None]:
# invoice_date: object -> date [YYYY-MM-DD] -> [YYYY-MM-DD]
df_train['invoice_date'] = pd.to_datetime(df_train['invoice_date'])
# start_date: object -> date [DD/MM/YYYY] -> [YYYY-MM-DD]
df_train['start_date'] = pd.to_datetime(df_train['start_date'], dayfirst=True)

col_names = ['mtr_type', 'district', 'client_type', 'region']
df_train[col_names] = df_train[col_names].astype("category")

# Converting 'mtr_val_' into float for now, as decimal places get in the way otherwise
df_train['mtr_val_old'] = df_train['mtr_val_old'].astype(float).round(0)
df_train['mtr_val_new'] = df_train['mtr_val_new'].astype(float).round(0)

### Data Structure and Hierarchy
Please refer to main notebook for the context.

In [11]:
# This will change depending on context, but is the generic grouping to order rows.
grp_cols = ['client_id', 'mtr_type', 'mtr_tariff', 'mtr_id'] # Then to be sorted by: 'invoice_date'

Based on data mining, some specific relationships have been deduced:
* usage_1 -> usage_4 are sequential "buckets" of usage, that if capped, can "spill" over into the next level.
* It is unclear how the presence of a cap is determined:
  * If mtr_tariff == [10|11] & mtr_code != [3__], more than usage_1 may be used.
  * If usage_3 is being used, so will usage_4.
  * usage_3 cap is equal to usage_1 cap. usage_2 cap is equal half usage_1 cap. usage_4 is uncapped.
  * Order of use is sequential; i.e., cannot use usage_2 before usage_1, etc.
  * If monthly cap == [50 | 300], usage_3 & usage_4 are not used.
* It is unclear how the quantity of the cap is determined:
  * The mtr_tariff and mtr_code seem to have an influence. mtr_status also seems to be influential.
  * If mtr_tariff == [10] & mtr_code == [1__], monthly cap is typically 50 | 100.
  * If mtr_tariff == [10] & mtr_code == [2__], monthly cap is typically 50 | 200.
  * If mtr_tariff == [11], monthly cap is typically 200 | 300.
* The "monthly cap" would be multiplied by the "months_num" to give the cap. This applies most of the time but cannot be relied upon.
* It is being assumed that there is an allowance of energy at different rates, provided on a monthly basis, thus requiring the scaling.
* mtr_val_old and mtr_val_new are meant to track the meter reading:
  * mtr_val_old will be prior to adding the invoice's usage, and mtr_val_new will be after.
  * The usage ("usage_n") is the sum of usage_1 -> usage_4, all multiplied by 'mtr_coef'.
  * This applies most of the time.
* On 'mtr_coef':
  * If mtr_coef > 1, and [usage_3|usage_4] > 0, monthly cap seems to be 200.
  * If mtr_coef > 1, and [usage_3|usage_4] > 0, 'mtr_code' == [483|5__].

Conclusions based on the above:
* A current invoice's mtr_val_old should match previous invoice's mtr_val_new if there was no gap in between. It can be more if there is a gap. It should not be less.
* Similarly, the months_num should equal the difference in invoice dates in months (if there were no gaps). It can be less if there is a gap. It should not be more.
* The mtr_val_old plus the usage should equal mtr_val_new.
* The delta of the mtr_val_new of the current invoice and previous invoice should be energy used during the delta of the two invoice dates.

### Data Quality Markers

Going to use some helper functions that identify data inconsistencies as compared to expectations.
#### Intra-Row: Energy Usage
* 'usage_N' is sum of the 'usage_{1|2|3|4}' columns.
* 'usage_N' / 'mtr_coef' = 'usage_n' which represents, presumably, the amount being paid for.
* 'mtr_val_old' + 'usage_n' = 'mtr_val_new'.

'Calc_Usage()' calculates these metrics and derives them via different possible permutations.

In [12]:
def Calc_Usage(df_train):
    rows = len(df_train)
    df_train['usage_N'] = df_train.loc[:, ['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1)
    df_train['usage_n'] = (df_train['usage_N'] / df_train.loc[:, 'mtr_coef']).round(0)
    # Calculating permutations of this:
    df_train['usage_n_calc'] = df_train['mtr_val_new'] - df_train['mtr_val_old']
    df_train['mtr_coef_calc'] = (df_train['usage_N'] / df_train['usage_n_calc']).round(1)
    df_train['mtr_val_new_calc'] = df_train['mtr_val_old'] + df_train['usage_n'].round(0)
    df_train['mtr_val_old_calc'] = df_train['mtr_val_new'] - df_train['usage_n'].round(0)
    mask = df_train['usage_n_calc'] != df_train['usage_n']
    print(f"Rows with unexpected usage values / Total Rows: {sum(mask)} / {rows}.")
    df_train['usage_flag'] = mask.astype(int)
    return df_train.copy()
df_train = Calc_Usage(df_train.copy())

Rows with unexpected usage values / Total Rows: 17533 / 4476749.


#### Inter-Row: Meter and Date Consistency
* 'mtr_val_{old|new}' should monotonically increase over time. 
* 'invoice_date' - 'months_num' >= Prev('invoice_date') and should strictly increase over time.

'Calc_Neighbours()' calculates these relationships. Since 'months_num' is an integer and vague in its definition, will be adding 1 month leeway to calculations.

In [13]:
def Calc_Neighbours(df_train, grp_cols=grp_cols):
    rows = len(df_train)
    # Sort, Group, Find neighbours
    df_train.sort_values(by=grp_cols+['invoice_date', 'mtr_code', 'usage_n'], inplace=True)
    df_grp = df_train.groupby(grp_cols, observed=True)
    df_train['mtr_val_new_prv_2'] = df_grp['mtr_val_new'].shift(2)
    col_names = ['mtr_val_old', 'mtr_val_new', 'invoice_date', 'months_num']
    df_train[[col_name+'_prv' for col_name in col_names]] = (df_grp[col_names].shift(1))
    df_train[[col_name+'_nxt' for col_name in col_names[:2]]] = (df_grp[col_names[:2]].shift(-1))
    df_train['mtr_val_old_nxt_2'] = df_grp['mtr_val_old'].shift(-2)
    # Calculate expected metrics based on relationships
    df_train['invoice_date_prv_calc'] = pd.NaT # If months_num is unreasonable it would crash
    mask = (df_train['months_num'] > 0) & (df_train['months_num'] < 600) # 0 - 50 Years
    df_train['invoice_date_prv_calc'] = df_train.loc[mask, 'invoice_date'] - pd.to_timedelta(df_train.loc[mask, 'months_num'].astype(int) * 30.5, unit='days')
    df_train['months_num_calc'] = np.ceil(((df_train['invoice_date'] - df_train['invoice_date_prv']).dt.days.values / 30.5))
    # Check Rules
    mask = ((df_train['mtr_val_old'] < df_train['mtr_val_new_prv']) & (df_train['mtr_val_new_prv'].notna()))
    df_train['mtr_flag_bkd'] = mask.astype(int)
    print(f"Rows where meter readings seem out of order looking backwards / Total Rows: {sum(mask)} / {rows}.")
    mask = ((df_train['mtr_val_new'] > df_train['mtr_val_old_nxt']) & (df_train['mtr_val_old_nxt'].notna()))
    df_train['mtr_flag_fwd'] = mask.astype(int)
    print(f"Rows where meter readings seem out of order looking forwards / Total Rows: {sum(mask)} / {rows}.")
    mask = ((df_train['invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') < df_train['invoice_date_prv']) & (df_train['invoice_date_prv_calc'].notna()))
    df_train['date_flag'] = mask.astype(int)
    print(f"Rows where invoice spans seem to overlap based on \'months_num\' / Total Rows: {sum(mask)} / {rows}.")
    mask = ((df_train['months_num_calc'] < df_train['months_num']) & (df_train['invoice_date_prv_calc'].notna()))
    df_train.loc[mask, 'date_flag'] = 1
    print(f"Rows where invoice spans seem to overlap based on \'invoice_date\'s / Total Rows: {sum(mask)} / {rows}.")
    return df_train.copy()

In [14]:
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows where meter readings seem out of order looking backwards / Total Rows: 597728 / 4476749.
Rows where meter readings seem out of order looking forwards / Total Rows: 597728 / 4476749.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 1014829 / 4476749.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 1014873 / 4476749.


### Data Corruption

#### Column Misalignment
##### Focusing on 'mtr_status'
'mtr_status' had some suspiciously rare values that were manually inspected: [3, 2, 46, A, 628, 769, 269375, 420]. 

From these, Values: ['46', '618', '769', '269375', '420'] seemed symptomatic of data quality issues.

In [15]:
mask = df_train['mtr_status'].isin(['46', '618', '769', '269375', '420']) # Manual check showed 'A' seems acceptable
print(f'number of bad mtr_status: {sum(mask)}.')
col_names = ['mtr_tariff', 'mtr_id', 'mtr_status', 'mtr_code', 'mtr_notes', 'mtr_coef', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'mtr_val_old', 'mtr_val_new', 'months_num']
pd.concat([df_train[~mask].head(5), df_train[mask].head(5)])[col_names] # Comparison

number of bad mtr_status: 34.


Unnamed: 0,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num
22,11,1335667,0,203,6,1,124,0,0,0,3685,3809,4
23,11,1335667,0,203,6,1,141,0,0,0,3809,3950,4
24,11,1335667,0,203,6,1,162,0,0,0,3950,4112,4
25,11,1335667,0,203,6,1,159,0,0,0,4112,4271,4
28,11,1335667,0,203,6,1,182,0,0,0,4271,4453,4
1178214,11,170,769,0,207,6,1,332,0,0,0,0,332
1178211,11,170,769,0,207,6,1,385,0,0,0,332,717
1178200,11,170,769,0,207,6,1,479,0,0,0,717,1196
1178209,11,170,769,0,207,6,1,437,0,0,0,1196,1633
1178207,11,170,769,0,207,6,1,453,0,0,0,1633,2086


For these, it is assumed that the columns have been shifted by mistake. One key sign here is usage_1 == 1 and mtr_coef == 6 | 8 | 9.

It is not clear why. In particular, it is not clear what 'mtr_id' and/or 'mtr_status' is meant to represent here and which might be wrong. 'mtr_id' will be kept and 'mtr_status' will be deleted.

'months_num' will be empty, so placing with rogue values of -1 for now.

In [16]:
# Flag and modify
df_train['col_shift_flag'] = mask.astype(int)
df_train.loc[mask, df_train.columns[4:14]] = df_train.loc[mask, df_train.columns[5:15]].values
df_train.loc[mask, df_train.columns[14]] = -1
df_train = Calc_Usage(df_train.copy())
pd.concat([df_train[~mask].head(5), df_train[mask].head(5)])[col_names] # Comparison

Rows with unexpected usage values / Total Rows: 17502 / 4476749.


Unnamed: 0,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num
22,11,1335667,0,203,6,1,124,0,0,0,3685,3809,4
23,11,1335667,0,203,6,1,141,0,0,0,3809,3950,4
24,11,1335667,0,203,6,1,162,0,0,0,3950,4112,4
25,11,1335667,0,203,6,1,159,0,0,0,4112,4271,4
28,11,1335667,0,203,6,1,182,0,0,0,4271,4453,4
1178214,11,170,0,207,6,1,332,0,0,0,0,332,-1
1178211,11,170,0,207,6,1,385,0,0,0,332,717,-1
1178200,11,170,0,207,6,1,479,0,0,0,717,1196,-1
1178209,11,170,0,207,6,1,437,0,0,0,1196,1633,-1
1178207,11,170,0,207,6,1,453,0,0,0,1633,2086,-1


##### Focusing on 'mtr_coef'
One identified issue is assumed to be caused by 'mtr_ceof' having had a decimal place that caused an offset in columns. This was deduced by suspicious 'months_num' and 'usage_2' values originally.

For example, "1,5" would cause the "5" to fall into 'usage_1', etc. 'months_num' was then overwritten and lost.

'mtr_coef' would therefore need to be reconstructed for these examples, and columns shifted. 

The lost 'months_num' are for now given rogue value of -1.

Rows have to be filtered carefully to identify only those related to the described data corruption problem:
* If(0 < 'usage_1' < 10), then flag.
  * On the assumption it is single decimal place, it must be a single digit.
  * All other conditions are applied after this.
* If('usage_2' > 0) or If('months_num' > 240), then flag.
  * The minimum monthly cap seen is 50, so 'usage_2' should not be used prior to that.
  * This would not apply if only true 'usage_1' was used, so is insufficient.
  * 'months_num' would not reasonably exceed 20 years.

In [17]:
# Finding those affected
mask = (((df_train['usage_1'] > 0) & (df_train['usage_1'] < 10)) &
        ((df_train['usage_2'] != 0)) | (df_train['months_num'] > 240))
df_temp = df_train[mask]
col_names = ['mtr_coef', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'mtr_val_old', 'mtr_val_new', 'months_num', 'usage_N', 'usage_n', 'usage_n_calc']
pd.concat([df_train[~mask].head(5), df_train[mask].head(5)])[col_names] # Comparison

Unnamed: 0,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,usage_N,usage_n,usage_n_calc
22,1,124,0,0,0,3685,3809,4,124,124.0,124
23,1,141,0,0,0,3809,3950,4,141,141.0,141
24,1,162,0,0,0,3950,4112,4,162,162.0,162
25,1,159,0,0,0,4112,4271,4,159,159.0,159
28,1,182,0,0,0,4271,4453,4,182,182.0,182
20214,1,5,1200,3024,0,0,495,3311,4229,4229.0,495
20219,1,5,229,0,0,0,342,495,234,234.0,342
20212,1,5,1200,10566,0,0,9971,17815,11771,11771.0,9971
20213,1,5,1200,8790,0,0,3311,9971,9995,9995.0,3311
20211,1,5,1200,10744,0,0,17815,25778,11949,11949.0,17815


In [18]:
# Enact the change, and keep record
df_train.loc[mask, 'col_shift_flag'] = 1
df_train['unk_months_num'] = mask.astype(int)
df_train['mtr_coef'] = df_train['mtr_coef'].astype(float)
df_train.loc[mask, 'mtr_coef'] = df_train.loc[mask, 'mtr_coef'] + (df_train.loc[mask, 'usage_1'].astype(float) / 10).round(1)
df_train.loc[mask, df_train.columns[8:14]] = df_train.loc[mask, df_train.columns[9:15]].values
df_train.loc[mask, df_train.columns[14]] = -1
print(f"Number of Rows where Columns seem offset based on ruleset / Total Rows: {sum(mask)} / {len(df_train)}.")
df_train = Calc_Usage(df_train.copy())
df_train[mask][col_names]

Number of Rows where Columns seem offset based on ruleset / Total Rows: 1392 / 4476749.
Rows with unexpected usage values / Total Rows: 16436 / 4476749.


Unnamed: 0,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,usage_N,usage_n,usage_n_calc
20214,1.5,1200,3024,0,0,495,3311,-1,4224,2816.0,2816
20219,1.5,229,0,0,0,342,495,-1,229,153.0,153
20212,1.5,1200,10566,0,0,9971,17815,-1,11766,7844.0,7844
20213,1.5,1200,8790,0,0,3311,9971,-1,9990,6660.0,6660
20211,1.5,1200,10744,0,0,17815,25778,-1,11944,7963.0,7963
...,...,...,...,...,...,...,...,...,...,...,...
4457223,1.5,200,100,200,1810,459733,461273,-1,2310,1540.0,1540
4457217,1.5,1000,407,0,0,465008,465946,-1,1407,938.0,938
4457218,1.5,200,100,200,239,463554,464047,-1,739,493.0,493
4457226,1.5,200,100,200,1321,458519,459733,-1,1821,1214.0,1214


#### Poor Date Parsing
Another identified issue is assumed caused by the original uploader of the dataset to Kaggle relying on automated date parsing, resulting in months and days being switched in places. This was deduced by considering the 'invoice_date', 'mtr_val_old', 'mtr_val_new', and 'months_num' fields. The significance is that 'months_num' is calculated on true 'invoice_date' and not the recorded 'invoice_date' here.

Using the data hierarchy mentioned before, it would broadly be expected for 'mtr_val_old' and 'mtr_val_new' to increase monotonically.

In [19]:
# Finding those affected
mask = (df_train['invoice_date'].dt.day <= 12) & (df_train['invoice_date'].dt.month <= 12)
print(f"Rows where meter readings day and month could have been switched / Total Rows: {sum(mask)} / {len(df_train)}.") # ~1/3 of times
print(f"Rows where meter readings seem out of order / Rows where day and month could have been switched: {sum(((df_train['mtr_flag_bkd'] == 1) | (df_train['mtr_flag_fwd'] == 1)) & mask)} / {sum(mask)}.") # ~1/3 of times

Rows where meter readings day and month could have been switched / Total Rows: 1976635 / 4476749.
Rows where meter readings seem out of order / Rows where day and month could have been switched: 998863 / 1976635.


In [20]:
# Flip the day and month for these cases, and keep record. There might be a vectorised way for this to speed this up...
df_train['date_flip_flag'] = mask.astype(int)
df_train.loc[mask, 'invoice_date'] = df_train.loc[mask, 'invoice_date'].apply(lambda x: x.replace(day=x.month, month=x.day) if pd.notna(x) else x)
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows where meter readings seem out of order looking backwards / Total Rows: 11899 / 4476749.
Rows where meter readings seem out of order looking forwards / Total Rows: 11899 / 4476749.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 4480 / 4476749.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 4525 / 4476749.


Given how significantly the data quality metric improved, it is unlikely to be due to chance. However, there is still a chance that some did not need switching, for which is nothing is done, unfortunately.

### Data Coherence

#### "Empty" Rows
Some rows are essentially empty. Either they have no recorded usage nor meter values, or similar. These mainly seem out of place, and do not provide much value. These will be removed unless it would not leave at least one row for each 'client_id' and 'mtr_type'.

'Remove_Rows()' function will handle this task.

In [21]:
def Remove_Rows(df_train, mask, reason, df_removed=None):
    df_train['temp_flag'] = mask
    # Must have at least one record kept for the group
    mask = mask & (df_train.groupby(['client_id', 'mtr_type'], observed=True)['temp_flag'].transform('min') == 0)
    df_temp = df_train[mask].copy()
    df_temp['removed'] = reason
    if df_removed is not None:
        df_removed = pd.concat([df_removed, df_temp.copy()], sort=False)
    else:
        df_removed = df_temp.copy()
    print(f'Rows to be removed / out of rows requested to be removed: {len(df_temp)} / {sum(df_train['temp_flag'])}.')
    df_train = df_train[~mask]
    return (df_train.copy(), df_removed.copy(), df_temp)

##### Focussing on 'mtr_coef':
'mtr_coef' of 0 is rare, and any energy usage in that row does not transfer onwards. These primarily are all equally rare 'mtr_tariff' of 8. It was checked that these were unrelated previous 'mtr_coef' issue, i.e., they were not 0.6 etc.

In [22]:
mask = df_train['mtr_coef'] == 0 # Manual check showed these do not contribute to energy usage over time. They reset each row.
df_train['bad_coef'] = mask.astype(int)
df_train, df_removed, df_temp = Remove_Rows(df_train, mask, 'mtr_empty')
df_temp[col_names].head()

Rows to be removed / out of rows requested to be removed: 45 / 45.


Unnamed: 0,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,usage_N,usage_n,usage_n_calc
179746,0.0,0,0,0,0,0,87,4,0,,87
179751,0.0,0,0,0,0,0,0,4,0,,0
179755,0.0,0,0,0,0,0,633,4,0,,633
997145,0.0,0,0,0,0,0,0,1,0,,0
1228208,0.0,0,0,0,0,0,0,4,0,,0


##### Focusing on ['usage_{1|2|3|4}', 'mtr_val_{old|new}']: "Empty" Meter
There are some rows where 'n_usage' == 'mtr_val_old' == 'mtr_val_new' == 0.

There are also some rows where 'n_usage' == 0 where 'mtr_val_old' == 'mtr_val_new' != 0. These will not be removed, but will be treated with lower priority if in conflict with another row. This is done by the being sorted by 'usage_n' such that, if all else equal, they'd appear before a row with 'usage_n' > 0, and so, will be enveloped by it.

In [23]:
mask = (df_train[['usage_N', 'mtr_val_old', 'mtr_val_new']] == 0).all(axis=1) # These are considered empty
df_train, df_removed, df_temp = Remove_Rows(df_train, mask, 'mtr_empty', df_removed)
df_temp[col_names].head()
df_train = Calc_Usage(df_train.copy())
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 103849 / 195401.
Rows with unexpected usage values / Total Rows: 16391 / 4372855.
Rows where meter readings seem out of order looking backwards / Total Rows: 11347 / 4372855.
Rows where meter readings seem out of order looking forwards / Total Rows: 11347 / 4372855.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3720 / 4372855.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3749 / 4372855.


#### Intra-Row: Mis-matched Usage
##### "Stuck" Meter:
There seems to be an issue with the 'mtr_val_{old|new}' not reflecting 'usage_n'. The cause is not clear, assumed some sort of stuck meter.
These will be changed if one of two checks are met:
  1. recalculating 'mtr_val_new' using 'mtr_val_old' and 'usage_n' aligns with the next row. 
  2. recalculating 'mtr_val_old' using 'mtr_val_new' and 'usage_n' aligns with the previous row. 

Done sequentially. This is not ideal as some cases will be missed.

In [24]:
# Checking with next row:
mask = ((df_train['mtr_val_old'] == df_train['mtr_val_new']) & (df_train['usage_n'] > 0)) # Meter not changing despite usage
mask_2 = (df_train['mtr_val_old_nxt'] != df_train['mtr_val_new_calc']) | (df_train['mtr_val_old_nxt'].isna()) # Don't trust adjustment
df_train['mtr_stuck'] = 0
df_train.loc[mask & mask_2, 'mtr_stuck'] = -1 # Those that wouldn't be fixed via adjustment
df_train.loc[mask & ~mask_2, 'mtr_stuck'] = 1 # Those being adjusted
df_train.loc[mask & ~mask_2, 'mtr_val_new'] = df_train.loc[mask & ~mask_2, 'mtr_val_new_calc'] # Replacing
print(f"Looking at Next Row: Number of Rows detected: {sum(mask)}.")
print(f"Looking at Next Row: Number of Rows deemed unfixable: {sum(mask & mask_2)}.")
print(f"Looking at Next Row: Number of Rows deemed fixable: {sum(mask & ~mask_2)}.")

# Checking with previous row:
mask = ((df_train['mtr_val_old'] == df_train['mtr_val_new']) & (df_train['usage_n'] > 0)) # Meter not changing despite usage
mask_2 = (df_train['mtr_val_new_prv'] != df_train['mtr_val_old_calc']) | (df_train['mtr_val_new_prv'].isna()) # Don't trust adjustment
df_train.loc[mask & ~mask_2, 'mtr_stuck'] = 1 # Those being adjusted
df_train.loc[mask & ~mask_2, 'mtr_val_new'] = df_train.loc[mask & ~mask_2, 'mtr_val_new_calc'] # Replacing
df_train.loc[(df_train['mtr_stuck'] != 1) & (mask & mask_2), 'mtr_stuck'] = -1 # Those that wouldn't be fixed via adjustment
print(f"Looking at Previous Row: Number of Rows detected: {sum(mask)}.")
print(f"Looking at Previous Row: Number of Rows deemed unfixable: {sum(mask & mask_2)}.")
print(f"Looking at Previous Row: Number of Rows deemed fixable: {sum(mask & ~mask_2)}.")

# Re-sort and check for the same rule again
df_train = Calc_Usage(df_train.copy())
df_train = Calc_Neighbours(df_train.copy(), grp_cols)
df_temp = df_train[df_train['mtr_stuck'] == -1]
col_names = ['mtr_coef', 'usage_N', 'usage_n', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'months_num', 'mtr_val_new_calc']
df_temp[col_names].head()

Looking at Next Row: Number of Rows detected: 7281.
Looking at Next Row: Number of Rows deemed unfixable: 4621.
Looking at Next Row: Number of Rows deemed fixable: 2660.
Looking at Previous Row: Number of Rows detected: 4621.
Looking at Previous Row: Number of Rows deemed unfixable: 4611.
Looking at Previous Row: Number of Rows deemed fixable: 10.
Rows with unexpected usage values / Total Rows: 13721 / 4372855.
Rows where meter readings seem out of order looking backwards / Total Rows: 11348 / 4372855.
Rows where meter readings seem out of order looking forwards / Total Rows: 11348 / 4372855.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3720 / 4372855.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3749 / 4372855.


Unnamed: 0,mtr_coef,usage_N,usage_n,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,months_num,mtr_val_new_calc
1865,1.0,2943,2943.0,7709.0,7709,7709,7709.0,8,10652.0
2865,1.0,5000,5000.0,3143.0,3143,3143,3143.0,18,8143.0
3004,1.0,204,204.0,1469.0,1469,1469,1469.0,12,1673.0
6864,1.0,20,20.0,157.0,157,157,157.0,4,177.0
6918,1.0,44,44.0,15678.0,15702,15702,15810.0,2,15746.0


##### "Phantom" Usage
An almost reverse of the above problem is when there is no apparent usage (usage_N == 0), but 'mtr_val_old' != 'mtr_val_new'.

If these rows are in the expected order in terms of dates and meter values (looking forwards), these instances are fixed. This done very simply by placing it all into 'usage_1'. Otherwise, these are left alone for now.

Often, it is the entire history of that 'mtr_tariff' & 'mtr_id' that has the issue. But sometimes, it is sporadic.

In [25]:
mask = (df_train['mtr_val_old'] != df_train['mtr_val_new']) & (df_train['usage_N'] == 0) # those with the missing usage
mask_2 = (df_train['date_flag'] == 0) & (df_train['mtr_flag_fwd'] == 0) 
df_train['usage_missing'] = (mask & mask_2).astype(int) # Those being adjusted
df_train.loc[(mask & ~mask_2), 'usage_missing'] = -1 # Not being adjusted
#f_train.loc[(mask & mask_2), 'usage_1'] = df_train.loc[(mask & mask_2), 'usage_n_calc'] * df_train.loc[(mask & mask_2), 'mtr_coef'] # Want unscaled value
print(f"Rows where meter changes whilst no usage is recorded, those adjusted / All possibly affected : {sum(mask & mask_2)} / {sum(mask)}.")
df_train = Calc_Usage(df_train.copy())
df_temp = df_train[(mask & ~mask_2)]
col_names = ['mtr_coef', 'usage_N', 'usage_n', 'usage_n_calc', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'months_num']
df_temp[col_names].head()

Rows where meter changes whilst no usage is recorded, those adjusted / All possibly affected : 2126 / 2154.
Rows with unexpected usage values / Total Rows: 13721 / 4372855.


Unnamed: 0,mtr_coef,usage_N,usage_n,usage_n_calc,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,months_num
129959,1.0,0,0.0,45,,0,45,0.0,2
240538,1.0,0,0.0,496,1192.0,1647,2143,1647.0,2
258541,1.0,0,0.0,7,,5194,5201,5194.0,2
806909,1.0,0,0.0,1037,,232,1269,152.0,4
829641,1.0,0,0.0,141,,55234,55375,55234.0,2


#### Inter-Row: Mis-matched Usage
##### "Reset" Meter (1 of 2):
It seems that sometimes, the meter is "reset" to retroactively adjust values. A complication is that this somewhat overlaps with a seperate issue of meter "roll over" when it exceeds maximum number of digits.

The signs of this happening are:
1. 'invoice_date' - 'months_num' >= Prev('invoice_date')
2. 'mtr_val_old' >= 'mtr_val_new_prv' (alternatively, 'mtr_val_new' <= 'mtr_val_old_nxt')

Not all of those breaking the first rule seem out of order, however, neither does removing the row introduce an issue. The 'months_num' seems quite coarse so have to allow for rounding values.

It will typically be the row prior to the one that is "reset" that will be removed to restore order.

This is split into two steps, in an attempt to disentangle it from the meter "roll over" issue.

Looking at those breaking both rules.

In [26]:
mask = (df_train['date_flag'] == 1)
mask_2 = (df_train['mtr_flag_bkd'] == 1) | (df_train['mtr_flag_fwd'] == 1)
print(f"Rows where meter readings also seem to be reset / Rows that seem out of order: {sum(mask & mask_2)} / {sum(mask)}.")
mask = mask & mask_2
df_train['temp_flag'] = pd.to_datetime(np.nan)
df_train.loc[mask, 'temp_flag'] = df_train.loc[mask, 'invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') # This is the date being sought after
df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].transform('shift', periods=-1) # Flag row before the newly overlapping row
df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].bfill() # Flag all previous within the group
df_train['temp_flag'] = (df_train['invoice_date'] > df_train['temp_flag']) & (df_train['temp_flag'].notna()) # Flag only those within the overlapping period
mask = df_train['temp_flag'] == True
print(f"Rows affected by reset meter readings / total Rows: {sum(mask)} / {len(df_train)}.")
df_temp = df_train[mask]
col_names = ['invoice_date_prv', 'invoice_date_prv_calc', 'months_num', 'invoice_date', 'mtr_val_old_prv', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'mtr_val_new_nxt']
df_temp[col_names].head()

Rows where meter readings also seem to be reset / Rows that seem out of order: 318 / 3749.
Rows affected by reset meter readings / total Rows: 447 / 4372855.


Unnamed: 0,invoice_date_prv,invoice_date_prv_calc,months_num,invoice_date,mtr_val_old_prv,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,mtr_val_new_nxt
1812,2013-11-14,2014-01-02,2,2014-03-04,1311.0,1571.0,1571,1796,1571.0,1571.0
9411,2007-12-17,2007-12-06,2,2008-02-05,0.0,23475.0,23475,23475,22348.0,23475.0
23100,2010-12-14,2011-02-08,4,2011-06-10,3377.0,3745.0,3745,3745,0.0,4471.0
26942,2007-09-26,2007-12-09,2,2008-02-08,11395.0,11397.0,11771,11771,11397.0,11771.0
57251,2009-01-05,2009-08-29,2,2009-10-29,6666.0,6666.0,6666,6666,6609.0,6609.0


In [27]:
# Remove those flagged
df_train['mtr_reset'] = mask.astype(int)
df_train, df_removed, df_temp = Remove_Rows(df_train.copy(), mask, 'mtr_reset', df_removed.copy())
df_temp[col_names].head()
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 447 / 447.
Rows where meter readings seem out of order looking backwards / Total Rows: 11104 / 4372408.
Rows where meter readings seem out of order looking forwards / Total Rows: 11104 / 4372408.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3406 / 4372408.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3435 / 4372408.


Looking at those breaking rule 1 but not rule 2:
* If removing the necessary rows prior leads to perfect alignment ('mtr_val_old' == 'mtr_val_new_prv'), do so
* If currently in order but not in perfect alignment:
    * If removing the necessary rows prior leads to being in order ('mtr_val_old' > 'mtr_val_new_prv'), do so

The logic here is to not worsen the state of alignment by the changes.

In [34]:
# Looking at those currently in perfect meter alignment
mask = (df_train['date_flag'] == 1) & (((df_train['mtr_val_new'] == df_train['mtr_val_old_nxt']) | (df_train['mtr_val_old_nxt'].notna())) |
                                       ((df_train['mtr_val_old'] == df_train['mtr_val_new_prv']) | (df_train['mtr_val_new_prv'].notna())))
df_train['temp_flag'] = pd.to_datetime(np.nan)
df_train.loc[mask, 'temp_flag'] = df_train.loc[mask, 'invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') # This is the date being sought after
df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].transform('shift', periods=-1) # Flag row before the newly overlapping row
df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].bfill() # Flag all previous within the group
df_train['temp_flag'] = (df_train['invoice_date'] > df_train['temp_flag']) & (df_train['temp_flag'].notna()) # Flag only those within the overlapping period

# Want to check whether removing these improves alignment.
df_temp = df_train[df_train['temp_flag'] == False].copy() # Comparison case
print("vvv Ignore outputs below! vvv")
df_temp = Calc_Neighbours(df_temp.copy(), grp_cols) # Update values
print("^^^ Ignore outputs above! ^^^")
df_temp['temp_flag_2'] = df_temp['mtr_val_old'] != df_temp['mtr_val_new_prv'] # Not perfect alignment
df_train['temp_flag_2'] = df_train['mtr_val_old'] != df_train['mtr_val_new_prv'] 

# Only those that reduces number of rows that have non-perfect alignment kept
mask = (df_temp.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum') < df_train.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum')).reset_index()
mask = mask[mask['temp_flag_2'] == True]
df_train['temp_flag_2'] = df_train.set_index(grp_cols).index.isin(mask.set_index(grp_cols).index) # Flagging the groups where it was improvement
mask = (df_train['temp_flag'] == True) & (df_train['temp_flag_2'] == True)
print(f'Number of rows with overlapping periods that would otherwise align if removed / out total overlapping rows: {sum(mask)} / {sum(df_train['temp_flag'] == 1)}.')

vvv Ignore outputs below! vvv
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4368964.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4368964.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 0 / 4368964.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 30 / 4368964.
^^^ Ignore outputs above! ^^^
Number of rows with overlapping periods that would otherwise align if removed / out total overlapping rows: 162 / 3444.


In [35]:
# Remove those flagged
df_train.loc[mask, 'mtr_reset'] = 1
df_train, df_removed, df_temp = Remove_Rows(df_train.copy(), mask, 'mtr_reset', df_removed.copy())
df_temp[col_names].head()
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 162 / 162.
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4372246.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4372246.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 3259 / 4372246.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 3288 / 4372246.


In [36]:
# Looking at those currently not in perfect meter alignment
mask =  (df_train['date_flag'] == 1) & (((df_train['mtr_val_new'] < df_train['mtr_val_old_nxt']) | (df_train['mtr_val_old_nxt'].notna())) |
                                       ((df_train['mtr_val_old'] > df_train['mtr_val_new_prv']) | (df_train['mtr_val_new_prv'].notna())))
df_train['temp_flag'] = pd.to_datetime(np.nan)
df_train.loc[mask, 'temp_flag'] = df_train.loc[mask, 'invoice_date_prv_calc'] + pd.to_timedelta(30.5, unit='days') # This is the date being sought after
df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].transform('shift', periods=-1) # Flag row before the newly overlapping row
df_train['temp_flag'] = df_train.groupby(grp_cols, observed=True)['temp_flag'].bfill() # Flag all previous within the group
df_train['temp_flag'] = (df_train['invoice_date'] > df_train['temp_flag']) & (df_train['temp_flag'].notna()) # Flag only those within the overlapping period

# Want to check whether removing these causes good alignment.
df_temp = df_train[df_train['temp_flag'] == False].copy() # Comparison case
print("vvv Ignore outputs below! vvv")
df_temp = Calc_Neighbours(df_temp.copy(), grp_cols) # Update values
print("^^^ Ignore outputs above! ^^^")
df_temp['temp_flag_2'] = df_temp['mtr_val_old'] < df_temp['mtr_val_new_prv']  # Not in order (previous test was for perfect alignment)
df_train['temp_flag_2'] = df_train['mtr_val_old'] > df_train['mtr_val_new_prv'] 

# Only those that reduces number of misaligned rows kept
mask = (df_temp.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum') < df_train.groupby(grp_cols, observed=True)['temp_flag_2'].agg('sum')).reset_index()
mask = mask[mask['temp_flag_2'] == True]
df_train['temp_flag_2'] = df_train.set_index(grp_cols).index.isin(mask.set_index(grp_cols).index) # Flagging the groups where it was improvement
mask = (df_train['temp_flag'] == True) & (df_train['temp_flag_2'] == True)
print(f'Number of rows with overlapping periods that stay aligned if removed / out total overlapping rows: {sum(mask)} / {sum(df_train['temp_flag'] == 1)}.')

vvv Ignore outputs below! vvv
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4368964.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4368964.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 0 / 4368964.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 30 / 4368964.
^^^ Ignore outputs above! ^^^
Number of rows with overlapping periods that stay aligned if removed / out total overlapping rows: 2159 / 3282.


In [37]:
# Remove those flagged
df_train.loc[mask, 'mtr_reset'] = 1
df_train, df_removed, df_temp = Remove_Rows(df_train.copy(), mask, 'mtr_reset', df_removed.copy())
df_temp[col_names].head()
df_train = Calc_Neighbours(df_train.copy(), grp_cols)

Rows to be removed / out of rows requested to be removed: 2159 / 2159.
Rows where meter readings seem out of order looking backwards / Total Rows: 11070 / 4370087.
Rows where meter readings seem out of order looking forwards / Total Rows: 11070 / 4370087.
Rows where invoice spans seem to overlap based on 'months_num' / Total Rows: 1119 / 4370087.
Rows where invoice spans seem to overlap based on 'invoice_date's / Total Rows: 1149 / 4370087.


Note that this leaves behind two sets of cases:
1. Those with overlapping invoice spans but with meters in order: where removing rows degrades alignment. 
   * Sub-set of breaking rule 1 but not rule 2.
2. Those that do not have overlapping 'months_num' but with meters out of order. 
   * Breaking rule 2 but not rule 1.

Will return to these later.

#### "Reset" Meter: Rollover
There is also a "rollover" issue with the 'mtr_val_{old|new}' columns, where some reset after exceeding 100000 | 1000000 back down to 0. Two scenarios:
* Rollover occurs within a row (intra-row)
* Rollover occurs between a row (it is assumed this requires a "gap" in dates) (inter-row)

One complication is that it is not clear what the cap is for a given meter, and this has to be inferred based on what caused the rollover. Where possible, looking to add amount to fix issue.

##### Inter-Row:

Unfortunately, it seems quite hard to distinguish meter rolling over from a meter re-write at times. 'train_Client_1005', idx 18663, it a good example.

In [82]:
df_temp = df_train[(df_train['client_id'] == 'train_Client_1005') & (df_train['mtr_id'] == 2117699)]
col_names = ['invoice_date_prv', 'invoice_date_prv_calc', 'months_num', 'invoice_date', 'usage_n', 'mtr_val_old_prv', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'mtr_val_new_nxt']
df_temp[col_names]

Unnamed: 0,invoice_date_prv,invoice_date_prv_calc,months_num,invoice_date,usage_n,mtr_val_old_prv,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,mtr_val_new_nxt
18672,NaT,2013-01-03 12:00:00,3,2013-04-05,257.0,,,3650.0,3907.0,3907.0,4017.0
18671,2013-04-05,2013-03-30 00:00:00,4,2013-07-30,110.0,3650.0,3907.0,3907.0,4017.0,4017.0,4132.0
18658,2013-07-30,2013-08-09 00:00:00,4,2013-12-09,115.0,3907.0,4017.0,4017.0,4132.0,4132.0,4385.0
18657,2013-12-09,2013-11-30 00:00:00,4,2014-04-01,253.0,4017.0,4132.0,4132.0,4385.0,4385.0,4505.0
18675,2014-04-01,2014-04-01 00:00:00,4,2014-08-01,120.0,4132.0,4385.0,4385.0,4505.0,4505.0,4606.0
18666,2014-08-01,2014-08-02 00:00:00,4,2014-12-02,101.0,4385.0,4505.0,4505.0,4606.0,4606.0,6852.0
18665,2014-12-02,2014-11-29 00:00:00,4,2015-03-31,2246.0,4505.0,4606.0,4606.0,6852.0,4995.0,5028.0
18663,2015-03-31,2015-09-26 00:00:00,2,2015-11-26,33.0,4606.0,6852.0,4995.0,5028.0,5028.0,5271.0
18678,2015-11-26,2015-11-27 00:00:00,4,2016-03-28,243.0,4995.0,5028.0,5028.0,5271.0,5271.0,5388.0
18668,2016-03-28,2016-03-28 00:00:00,4,2016-07-28,117.0,5028.0,5271.0,5271.0,5388.0,5538.0,5704.0


The meter values goes from ~7k to ~5k, with there being a 6 month gap in invoice periods. If projecting the previous invoice's usage into said gap, it'd estimate usage of ~10k+, which may indicate a rollover.

In [None]:
#df_train_save, df_removed_save = df_train.copy(), df_removed.copy()
df_train, df_removed = df_train_save.copy(), df_removed_save.copy()

In [None]:
# Store max digits seen for a group, assumed constant and that rollover could not happen less than this.
df_train['mtr_max_digit'] = df_train[['mtr_val_old', 'mtr_val_new']].max(axis=1) # value per row
df_train['mtr_max_digit'] = df_train.groupby(grp_cols, observed=True)[['mtr_max_digit']].transform('max') # value per group
df_train['mtr_max_digit'] = 10 ** np.ceil(np.log10(df_train['mtr_max_digit'] + 1)) # Number of digits

# Looking at rollover across rows, first find those with gaps (3+ M gap)
# Ignoring those where usage_n does not seem to match mtr_vals
df_train['months_gap_calc'] = df_train['months_num_calc'] - df_train['months_num'] # Difference in dates in months
mask = (df_train['months_gap_calc'] > 2) & (df_train['usage_flag'] == 0)

# Looking if current usage would have caused rollover going backwards:
df_train['temp_flag'] = 0
df_train.loc[mask, 'temp_flag'] = (df_train.loc[mask, 'usage_n'].div(df_train.loc[mask, 'months_num']) * df_train.loc[mask, 'months_gap_calc']).round(0) # Projected usage
mask_2 = ((df_train['mtr_val_old'] < df_train['temp_flag']) & # Had enough (gap * usage) to have caused rollover
          (df_train['mtr_flag_bkd']) & # Out of order
          (df_train['mtr_val_new_prv'] + df_train['temp_flag'] > df_train['mtr_max_digit']) & # Would have caused highest digit count seen
          ???) # Gets within ballpark of actual values ~2k

# Looking if previous usage would have caused rollover going forwards:
df_train['temp_flag'] = 0
df_train.loc[mask, 'temp_flag'] = ((df_train.loc[mask, 'mtr_val_new_prv'] - df_train.loc[mask, 'mtr_val_old_prv']).div(df_train.loc[mask, 'months_num_prv']) * df_train.loc[mask, 'months_gap_calc']).round(0) # Projected usage
mask = ((df_train['mtr_flag_bkd']) & # Out of order
        (df_train['mtr_val_new_prv'] + df_train['temp_flag'] > df_train['mtr_max_digit']) & # Would have caused highest digit count seen (and so also rollover)
         ???) # Gets within ballpark of actual values ~2k

print(f"Number of Rows suspected of soft resetting between rows projecting backwards / those projecting forwards: {sum(mask_2)} / {sum(mask)}.")
df_train['mtr_old_rollover'] = (mask | mask_2).astype(int)

df_temp = df_train[mask | mask_2]
df_temp[['client_id', 'invoice_date', 'mtr_coef', 'usage_n', 'mtr_val_new_prv', 'mtr_val_old', 'mtr_val_new', 'mtr_val_old_nxt', 'months_num', 'mtr_val_new_calc']]

Number of Rows suspected of soft resetting between rows projecting backwards / those projecting forwards: 158 / 409.


Unnamed: 0,client_id,invoice_date,mtr_coef,usage_n,mtr_val_new_prv,mtr_val_old,mtr_val_new,mtr_val_old_nxt,months_num,mtr_val_new_calc
1643,train_Client_100044,2012-12-11,1.0,492.0,9977.0,219.0,711.0,711.0,4,711.0
9094,train_Client_100257,2010-01-26,1.0,2171.0,8992.0,2860.0,5031.0,7420.0,4,5031.0
14813,train_Client_100393,2016-04-13,1.0,322.0,92621.0,86750.0,87072.0,91022.0,2,87072.0
18663,train_Client_1005,2015-11-26,1.0,33.0,6852.0,4995.0,5028.0,5028.0,2,5028.0
25836,train_Client_100717,2011-10-13,1.0,556.0,9537.0,1429.0,1985.0,1985.0,4,1985.0
...,...,...,...,...,...,...,...,...,...,...
4417785,train_Client_98405,2018-12-19,1.0,50.0,9554.0,2632.0,2682.0,,2,2682.0
4443671,train_Client_99083,2008-01-02,1.0,12387.0,90323.0,4689.0,17076.0,17076.0,4,17076.0
4452373,train_Client_99339,2019-04-19,1.0,76.0,97632.0,4116.0,4192.0,,2,4192.0
4466803,train_Client_99736,2008-10-03,1.0,601.0,612.0,103.0,704.0,106.0,2,704.0


In [76]:
10 ** np.ceil(np.log10(999+1))

np.float64(1000.0)

In [75]:
10 ** np.ceil(np.log10(999))

np.float64(1000.0)

In [51]:
np.log10(49)

np.float64(1.6901960800285136)