## Exploring Data

Import packages needed to explore and modify the data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# Read the CSV files
df_client_test = pd.read_csv('./client_test.csv', on_bad_lines='skip')
df_client_train = pd.read_csv('./client_train.csv', on_bad_lines='skip')
df_invoice_test = pd.read_csv('./invoice_test.csv', on_bad_lines='skip')
# low_memory is prompted due to unexpected values and large datasize
df_invoice_train = pd.read_csv('./invoice_train.csv', on_bad_lines='skip', low_memory=False)
df_SampleSubmission = pd.read_csv('./SampleSubmission (2).csv', on_bad_lines='skip')

In [3]:
df_SampleSubmission.head()

Unnamed: 0,client_id,target
0,test_Client_0,0.957281
1,test_Client_1,0.996425
2,test_Client_10,0.612359
3,test_Client_100,0.776933
4,test_Client_1000,0.571046


Conclusion: df_SampleSubmission is not relevant for this project; it is expected format for evaluating performance on Kaggle.

In [4]:
df_client_test.head()

Unnamed: 0,disrict,client_id,client_catg,region,creation_date
0,62,test_Client_0,11,307,28/05/2002
1,69,test_Client_1,11,103,06/08/2009
2,62,test_Client_10,11,310,07/04/2004
3,60,test_Client_100,11,101,08/10/1992
4,62,test_Client_1000,11,301,21/07/1977


In [5]:
df_invoice_test.head()

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,test_Client_0,2018-03-16,11,651208,0,203,8,1,755,0,0,0,19145,19900,8,ELEC
1,test_Client_0,2014-03-21,11,651208,0,203,8,1,1067,0,0,0,13725,14792,8,ELEC
2,test_Client_0,2014-07-17,11,651208,0,203,8,1,0,0,0,0,14792,14792,4,ELEC
3,test_Client_0,2015-07-13,11,651208,0,203,9,1,410,0,0,0,16122,16532,4,ELEC
4,test_Client_0,2016-07-19,11,651208,0,203,9,1,412,0,0,0,17471,17883,4,ELEC


In [6]:
print(f"Number of rows in client train vs invoice train: {len(df_client_train)} vs {len(df_invoice_train)}")
print(f"Number of unique client_id in client train vs invoice train: {df_client_train['client_id'].nunique()} vs {df_invoice_train['client_id'].nunique()}")
print(f"Number of rows in client test vs invoice test: {len(df_client_test)} vs {len(df_invoice_test)}")
print(f"Number of unique client_id in client test vs invoice test: {df_client_test['client_id'].nunique()} vs {df_invoice_test['client_id'].nunique()}", end="")

Number of rows in client train vs invoice train: 135493 vs 4476749
Number of unique client_id in client train vs invoice train: 135493 vs 135493
Number of rows in client test vs invoice test: 58069 vs 1939730
Number of unique client_id in client test vs invoice test: 58069 vs 58069

Conclusion: This was done for data normalisation. Not too interested in that, going to merge them.

In [7]:
df_train = df_invoice_train.join(df_client_train.set_index('client_id'), on='client_id', validate='m:1').copy()
df_train

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,...,consommation_level_4,old_index,new_index,months_number,counter_type,disrict,client_catg,region,creation_date,target
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,...,0,14302,14384,4,ELEC,60,11,101,31/12/1994,0.0
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,...,0,12294,13678,4,ELEC,60,11,101,31/12/1994,0.0
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,...,0,14624,14747,4,ELEC,60,11,101,31/12/1994,0.0
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,...,0,14747,14849,4,ELEC,60,11,101,31/12/1994,0.0
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,...,0,15066,15638,12,ELEC,60,11,101,31/12/1994,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476744,train_Client_99998,2005-08-19,10,1253571,0,202,9,1,400,135,...,0,3197,3732,8,ELEC,60,11,101,22/12/1993,0.0
4476745,train_Client_99998,2005-12-19,10,1253571,0,202,6,1,200,6,...,0,3732,3938,4,ELEC,60,11,101,22/12/1993,0.0
4476746,train_Client_99999,1996-09-25,11,560948,0,203,6,1,259,0,...,0,13884,14143,4,ELEC,60,11,101,18/02/1986,0.0
4476747,train_Client_99999,1996-05-28,11,560948,0,203,6,1,603,0,...,0,13281,13884,4,ELEC,60,11,101,18/02/1986,0.0


In [8]:
df_test = df_invoice_test.join(df_client_test.set_index('client_id'), on='client_id', validate='m:1').copy()
df_test

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type,disrict,client_catg,region,creation_date
0,test_Client_0,2018-03-16,11,651208,0,203,8,1,755,0,0,0,19145,19900,8,ELEC,62,11,307,28/05/2002
1,test_Client_0,2014-03-21,11,651208,0,203,8,1,1067,0,0,0,13725,14792,8,ELEC,62,11,307,28/05/2002
2,test_Client_0,2014-07-17,11,651208,0,203,8,1,0,0,0,0,14792,14792,4,ELEC,62,11,307,28/05/2002
3,test_Client_0,2015-07-13,11,651208,0,203,9,1,410,0,0,0,16122,16532,4,ELEC,62,11,307,28/05/2002
4,test_Client_0,2016-07-19,11,651208,0,203,9,1,412,0,0,0,17471,17883,4,ELEC,62,11,307,28/05/2002
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1939725,test_Client_9999,2010-03-11,11,869269,0,203,6,1,248,0,0,0,21114,21362,4,ELEC,69,11,104,14/03/1990
1939726,test_Client_9999,2011-03-15,11,869269,0,203,6,1,260,0,0,0,21362,21622,4,ELEC,69,11,104,14/03/1990
1939727,test_Client_9999,2012-12-03,11,869269,0,203,6,1,312,0,0,0,22272,22584,4,ELEC,69,11,104,14/03/1990
1939728,test_Client_9999,2009-07-01,11,869269,0,203,6,1,236,0,0,0,19909,20145,4,ELEC,69,11,104,14/03/1990


In [9]:
del df_invoice_train, df_invoice_test, df_client_train, df_client_test, df_SampleSubmission

Would like to condense the data a bit. There are some comments from the source regarding the meaning of these columns. These included here verbatim.

"client_train.csv":

"disrict: District where the client is"

"client_id: Unique id for client"

"client_catg: Category client belongs to"

"region: Area where the client is"

"creation_date: Date client joined"

"target: fraud:1, not fraud: 0"

"invoice_train.csv":

"client_id: Unique id for client"

"invoice_date: Date of the invoice"

"tarif_type: Type of tax"

"counter_number: number"

"counter_statue: akes up to 5 values such as working fine, not working, on hold statue, ect"

"counter_code: code"

"reading_remarque: notes that the STEG agent takes during his visit to the cleint (e.g.: if the counter shows something wrong, the"

"counter_coefficient: An additional coefficient to be added when standard consumption is exceeded"

"consommation_level_1: Consumption_level_1"

"consommation_level_2: Consumption_level_2"

"consommation_level_3: Consumption_level_3"

"consommation_level_4: Consumption_level_4"

"old_index: Old index"

"new_index: New index"

"months_number: Month number"

"counter_type: Type of counter"

Going to rename the columns slightly.

In [10]:
rename_dict = {'client_id' : 'client_id', 
               'invoice_date' : 'invoice_date', 
               'tarif_type' : 'mtr_tariff', 
               'counter_number' : 'mtr_id',
               'counter_statue' : 'mtr_status', 
               'counter_code' : 'mtr_code', 
               'reading_remarque' : 'mtr_notes',
               'counter_coefficient' : 'mtr_coef', 
               'consommation_level_1' : 'usage_1', 
               'consommation_level_2' : 'usage_2',
               'consommation_level_3' : 'usage_3', 
               'consommation_level_4' : 'usage_4', 
               'old_index' : 'mtr_val_old',
               'new_index' : 'mtr_val_new', 
               'months_number': 'months_num', 
               'counter_type' : 'mtr_type', 
               'disrict' : 'district', 
               'client_catg' : 'client_type',
               'region' : 'region', 
               'creation_date' : 'start_date', 
               'target' : 'fraud'}

df_train.rename(columns=rename_dict, inplace=True)

In [11]:
df_train.dtypes

client_id        object
invoice_date     object
mtr_tariff        int64
mtr_id            int64
mtr_status       object
mtr_code          int64
mtr_notes         int64
mtr_coef          int64
usage_1           int64
usage_2           int64
usage_3           int64
usage_4           int64
mtr_val_old       int64
mtr_val_new       int64
months_num        int64
mtr_type         object
district          int64
client_type       int64
region            int64
start_date       object
fraud           float64
dtype: object

Converting data types where appropriate:

In [12]:
# invoice_date: object -> date [YYYY-MM-DD] -> [YYYY-MM-DD]
df_train['invoice_date'] = pd.to_datetime(df_train['invoice_date'])
# start_date: object -> date [DD/MM/YYYY] -> [YYYY-MM-DD]
df_train['start_date'] = pd.to_datetime(df_train['start_date'], dayfirst=True)

In [13]:
# Looking at unique values for the columns (but skipping the dates)
col_names = df_train.columns
col_names = [col_name for col_name in col_names if 'date' not in col_name]
# Anything with a high ratio means a lot of repeated values found
for col_name in col_names:
    print(f"Column name: {col_name}, Unique Values: {df_train[col_name].unique()}")
    print(f"Column name: {col_name}, Unique Values Ratio: {len(df_train) / df_train[col_name].nunique()}")

Column name: client_id, Unique Values: ['train_Client_0' 'train_Client_1' 'train_Client_10' ...
 'train_Client_99997' 'train_Client_99998' 'train_Client_99999']
Column name: client_id, Unique Values Ratio: 33.04044489383215
Column name: mtr_tariff, Unique Values: [11 40 15 10 12 14 13 45 29  9 30  8 21 42 27 18 24]
Column name: mtr_tariff, Unique Values Ratio: 263338.17647058825
Column name: mtr_id, Unique Values: [1335667  678902  572765 ... 4811719  262195  560948]
Column name: mtr_id, Unique Values Ratio: 22.173869326821634
Column name: mtr_status, Unique Values: ['0' '1' '5' '4' '3' '2' '769' 'A' '618' '269375' '46' '420']
Column name: mtr_status, Unique Values Ratio: 373062.4166666667
Column name: mtr_code, Unique Values: [203 207 413   5 467 202 420 410  10 483  25 433 407 204 214 442 453 506
 450 403 333 201 102 305 210 101 532  40 310 565 600 307 303 222  65   0
 227 325  16 317 367   1]
Column name: mtr_code, Unique Values Ratio: 106589.26190476191
Column name: mtr_notes, Uniq

There are some issues with the data about to be discussed. Changing only columns that aren't affected here

In [14]:
# The following columns:
# mtr_tariff, mtr_status, mtr_code, mtr_notes, mtr_coef, mtr_type, district, client_type, region
# all seem very repetitive and are going to be treated as categories.
col_names = ['mtr_type', 'district', 'client_type', 'region'] # Those excluded will be dealt with later
df_train[col_names] = df_train[col_names].astype("category")

In [15]:
col_names = ['client_id', 'mtr_tariff', 'mtr_id', 'mtr_status', 'mtr_code', 'mtr_notes', 'mtr_coef', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'mtr_val_old', 'mtr_val_new', 'months_num', 'fraud']

Based on data mining, some relationships have been deduced:
* usage_1 -> usage_4 are connected in a complicated way. 
* It seems that the usage_X will affected based on either the mtr_tariff or mtr_code.
* There are two main aspects that can be affected: whether a usage_X is capped, and if so, what the cap should be.
* The usage, will "spill" over into the next level if capped and the cap is exceeded.
* usage_3 cap is equal to usage_1 cap. usage_2 cap is equal half usage_1 cap. usage_4 is uncapped. These are sequentially "filled".
* It does not seem to matter which usage_X is counted: it is not that they're scaled differently etc.
* However, mtr_coef does scale all usage_X values.
* mtr_val_old and mtr_val_new are meant to track the meter reading, before and after the usage_X is applied.
* Importantly, the data is partially corrupted. Therefore, the order that I present things is based on reversing corrected the data more than on being easy to follow along.

Rule 1: 'mtr_code' != 0
Otherwise, this seems a sign of corruption

In [16]:
df_train['mtr_status'].value_counts()

mtr_status
0         4379008
1           74036
5           20639
4            2729
3             258
2              32
46             14
A              13
618            12
769             6
269375          1
420             1
Name: count, dtype: int64

In [17]:
df_train['mtr_code'].value_counts()

mtr_code
203    1516836
5      1352035
207     555628
413     378917
202     343251
420      98273
410      69080
433      34447
10       27744
442      17050
25       14934
407      13768
204      12427
453       8290
201       7672
467       7302
506       3389
483       2830
214       2643
40        2482
532       1982
565       1082
403       1070
450       1059
600        807
210        789
102        313
227        178
101         74
65          74
16          73
310         51
307         50
305         44
222         42
0           33
317         16
303          5
333          4
367          3
325          1
1            1
Name: count, dtype: int64

In [18]:
df_train['mtr_notes'].value_counts()

mtr_notes
6      2230939
9      1416992
8       828123
7          661
203         15
413         12
207          6
5            1
Name: count, dtype: int64

In [19]:
df_train['mtr_tariff'].value_counts()

mtr_tariff
11    2679872
40    1379755
10     276210
15      72422
45      17552
13      11656
14      11611
12      11345
29      10090
9        6039
21        104
8          43
30         35
24          9
18          4
27          1
42          1
Name: count, dtype: int64

In [20]:
df_train['mtr_coef'].value_counts()

mtr_coef
1     4475102
2         886
3         321
40        197
30        137
0          46
6          30
4          12
10          6
9           3
20          3
50          2
33          1
5           1
11          1
8           1
Name: count, dtype: int64

Looking at outlier values for 'mtr_status'

In [21]:
mask = df_train['mtr_status'].isin(['46', '618', '769', '269375', '420']) # Manual check showed 'A' seems acceptable
print(f'number of bad mtr_status: {sum(mask)}.')
df_train['adj_flag'] = mask
mask = df_train.groupby(['client_id', 'mtr_tariff', 'mtr_id'])['adj_flag'].transform('any')
print(f'number of affected rows if grouped: {sum(mask)}.')
df_train[['invoice_date']+col_names][mask]

number of bad mtr_status: 34.
number of affected rows if grouped: 34.


Unnamed: 0,invoice_date,client_id,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,fraud
1178200,2010-11-19,train_Client_13203,11,170,769,0,207,6,1,479,0,0,0,717,1196,0.0
1178205,2011-11-21,train_Client_13203,11,170,769,0,207,6,1,642,0,0,0,2086,2728,0.0
1178207,2011-07-21,train_Client_13203,11,170,769,0,207,6,1,453,0,0,0,1633,2086,0.0
1178209,2011-03-29,train_Client_13203,11,170,769,0,207,6,1,437,0,0,0,1196,1633,0.0
1178211,2010-07-19,train_Client_13203,11,170,769,0,207,6,1,385,0,0,0,332,717,0.0
1178214,2010-03-24,train_Client_13203,11,170,769,0,207,6,1,332,0,0,0,0,332,0.0
2556034,2017-08-16,train_Client_47780,11,752,618,0,413,6,1,0,0,0,0,2,2,0.0
2556035,2017-12-19,train_Client_47780,11,752,618,0,413,9,1,1,0,0,0,2,3,0.0
2556036,2015-12-17,train_Client_47780,11,752,618,0,413,9,1,2,0,0,0,0,2,0.0
2556037,2016-04-18,train_Client_47780,11,752,618,0,413,6,1,0,0,0,0,2,2,0.0


For these, it is assumed that the columns have been shifted by mistake. One key sign here is usage_1 == 1 and mtr_coef == 6 | 8 | 9.
It is not clear why. Likely, the changes will introduce errors. However, given the time constraint, the goal is reduce the net quantity of errors.
In particular, it is not clear what 'mtr_id' and/or 'mtr_status' is meant to represent here and which might be wrong. 'mtr_id' will be kept and 'mtr_status' will be deleted.
Not yet sure what to do with months_num so placing rogue values of -1.

In [22]:
df_train.loc[mask, df_train.columns[4:14]] = df_train.loc[mask, df_train.columns[5:15]].values
df_train.loc[mask, df_train.columns[14]] = -1
df_train[col_names][mask]

Unnamed: 0,client_id,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,fraud
1178200,train_Client_13203,11,170,0,207,6,1,479,0,0,0,717,1196,-1,0.0
1178205,train_Client_13203,11,170,0,207,6,1,642,0,0,0,2086,2728,-1,0.0
1178207,train_Client_13203,11,170,0,207,6,1,453,0,0,0,1633,2086,-1,0.0
1178209,train_Client_13203,11,170,0,207,6,1,437,0,0,0,1196,1633,-1,0.0
1178211,train_Client_13203,11,170,0,207,6,1,385,0,0,0,332,717,-1,0.0
1178214,train_Client_13203,11,170,0,207,6,1,332,0,0,0,0,332,-1,0.0
2556034,train_Client_47780,11,752,0,413,6,1,0,0,0,0,2,2,-1,0.0
2556035,train_Client_47780,11,752,0,413,9,1,1,0,0,0,2,3,-1,0.0
2556036,train_Client_47780,11,752,0,413,9,1,2,0,0,0,0,2,-1,0.0
2556037,train_Client_47780,11,752,0,413,6,1,0,0,0,0,2,2,-1,0.0


Looking at outlier values for 'mtr_coef'

'mtr_coef' of 0 is rare, and any energy usage in that row does not transfer onwards. These rows will be deleted.

In [23]:
mask = df_train['mtr_coef'] == 0 # Manual check showed these do not contribute to energy usage over time. They reset each row.
print(f'number of mtr_coef == 0: {sum(mask)}.')
df_removed = df_train[mask].copy() # Record of deleted rows
df_train = df_train[~mask]

number of mtr_coef == 0: 46.


In [24]:
df_train['months_num'].value_counts()

months_num
4         3680456
8          278770
2          270886
1          113265
12          54328
           ...   
378671          1
374707          1
454440          1
370510          1
38329           1
Name: count, Length: 1350, dtype: int64

'usage_n' is sum of the usage_X columns, and it ought to represent the total usage of that row. This usage is meant to be scaled by 'mtr_coef' first (via division). This ought to be reflected in the difference between 'mtr_val_old' and 'mtr_val_new'. However, it is often not, due to various issues. This is quite tricky and I don't have time to "solve" it fully.

In [25]:
# Getting total usage (scaled by coef):
df_train.loc[:, 'usage_n'] = (df_train[['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1) / df_train['mtr_coef'])
df_train.loc[:, 'mtr_val_new_calc'] = df_train.loc[:, 'mtr_val_old'] + df_train.loc[:, 'usage_n'] # This is what is expected if all is working
mask = df_train['mtr_val_new'] != df_train['mtr_val_new_calc']
df_temp = df_train[mask]
df_temp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.loc[:, 'usage_n'] = (df_train[['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1) / df_train['mtr_coef'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.loc[:, 'mtr_val_new_calc'] = df_train.loc[:, 'mtr_val_old'] + df_train.loc[:, 'usage_n'] # This is what is expected if all is working


Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,months_num,mtr_type,district,client_type,region,start_date,fraud,adj_flag,usage_n,mtr_val_new_calc
1120,train_Client_100024,2014-06-20,11,482429,0,207,8,1,800,400,...,4,ELEC,60,11,101,1985-03-22,0.0,False,1518.0,100778.0
1215,train_Client_100025,2012-02-17,11,1028965,0,203,6,1,372,0,...,4,ELEC,63,11,311,2005-12-15,0.0,False,372.0,7324.0
1865,train_Client_100050,2012-01-25,11,112753,0,413,9,1,2400,543,...,8,ELEC,63,11,311,1994-12-31,0.0,False,2943.0,10652.0
2632,train_Client_100075,2008-11-27,11,502690,0,410,6,1,116,0,...,14,ELEC,60,11,101,2007-09-13,0.0,False,116.0,116.0
2818,train_Client_10008,2015-05-01,11,842528,0,413,9,1,800,400,...,4,ELEC,63,11,101,1987-09-14,0.0,False,2363.0,101849.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4475524,train_Client_99964,2010-08-11,15,403585,0,202,9,1,215,0,...,2,ELEC,60,11,101,1978-10-04,0.0,False,215.0,4550.0
4475991,train_Client_99978,2011-08-12,40,7006449,0,5,6,1,395,0,...,6,GAZ,69,11,104,1995-10-30,0.0,False,395.0,2876.0
4476318,train_Client_99985,2006-10-18,11,974695,0,532,6,3,1200,48618,...,4,ELEC,60,51,101,1994-12-31,0.0,False,16606.0,116514.0
4476420,train_Client_99988,2012-05-01,40,70623,0,5,6,1,61,0,...,4,GAZ,63,11,101,2002-12-10,0.0,False,61.0,1611.0


There are quite a lot of cases to consider and this was result of a lot of trial and error:
* Rule 1: If 'months_num' > 240, Then Flag
* Rule 2a: If columns 'usage_2' -> 'mtr_val_new' are all 0, Then Ignore
* Rule 2b: Else If 'usage_3' > 0, and 'usage_2' is not half 'usage_1', Then Flag
* Rule 2c: Else If 'usage_1' is 1-9, and 'usage_2' is not 0, Then Flag
* Rule 2d: Else If 'usage_1' is 1-9, and 'mtr_val_new' equals 'months_num', and 'mtr_val_new' is not 'mtr_val_new_calc', Then Flag
* Rule 3: If Flagged and 'mtr_val_old' is not 0, and 'usage_4' is 0, Then Ignore (Unflag)

In [26]:
# Rules 1 and 2
mask = (((df_train['usage_2'] + df_train['usage_3'] + df_train['usage_4'] + df_train['mtr_val_old'] + df_train['mtr_val_new'] != 0) &
        (((df_train['usage_3'] > 0) & ((df_train['usage_1'] / 2) != df_train['usage_2'])) |
         ((df_train['usage_1'] > 0) & (df_train['usage_1'] < 10) & (df_train['usage_2'] > 0))) |
         ((df_train['usage_1'] > 0) & (df_train['usage_1'] < 10) & (df_train['mtr_val_new'] == df_train['months_num']) & (df_train['mtr_val_new'] != df_train['mtr_val_new_calc']))) |
         (df_train['months_num'] > 240))
df_train.loc[:, 'temp_flag'] = mask
# Rule 3
df_train.loc[(df_train['temp_flag'] == True) & (df_train['mtr_val_old'] != 0) & (df_train['usage_4'] == 0), 'temp_flag'] = False

mask = df_train['temp_flag'] == True
df_temp = df_train[mask]
df_temp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.loc[:, 'temp_flag'] = mask


Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,mtr_type,district,client_type,region,start_date,fraud,adj_flag,usage_n,mtr_val_new_calc,temp_flag
20211,train_Client_100551,2010-08-30,11,1099471,0,467,6,1,5,1200,...,ELEC,60,11,101,2009-02-20,0.0,False,11949.0,11949.0,True
20212,train_Client_100551,2010-06-05,11,1099471,0,467,9,1,5,1200,...,ELEC,60,11,101,2009-02-20,0.0,False,11771.0,11771.0,True
20213,train_Client_100551,2010-08-01,11,1099471,0,467,9,1,5,1200,...,ELEC,60,11,101,2009-02-20,0.0,False,9995.0,9995.0,True
20214,train_Client_100551,2009-03-09,11,1099471,0,467,9,1,5,1200,...,ELEC,60,11,101,2009-02-20,0.0,False,4229.0,4229.0,True
20215,train_Client_100551,2011-03-14,11,1099471,5,467,6,1,5,0,...,ELEC,60,11,101,2009-02-20,0.0,False,5.0,5.0,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4457223,train_Client_99465,2015-03-02,11,99072847,0,506,9,1,5,200,...,ELEC,60,51,101,2000-03-22,0.0,False,505.0,2315.0,True
4457224,train_Client_99465,2011-04-07,11,99072847,0,506,6,1,5,300,...,ELEC,60,51,101,2000-03-22,0.0,False,3938.0,3938.0,True
4457225,train_Client_99465,2011-02-08,11,99072847,0,506,6,1,5,300,...,ELEC,60,51,101,2000-03-22,0.0,False,4632.0,4632.0,True
4457226,train_Client_99465,2015-06-01,11,99072847,0,506,9,1,5,200,...,ELEC,60,51,101,2000-03-22,0.0,False,505.0,1826.0,True


For these, it is assumed that 'mtr_ceof' had a decimal place that caused an offset. For example, "1,5" caused the 5 to fall into 'usage_1', etc. 'months_num' was then overwritten and lost.
'mtr_coef' will be reconstructed for these examples, and columns shifted. Not yet sure what to do with months_num so placing rogue values of -1.

In [27]:
df_train.loc[mask, 'adj_flag'] = True
df_train['mtr_coef'] = df_train['mtr_coef'].astype('float64')
df_train.loc[mask, 'mtr_coef'] = df_train.loc[mask, 'mtr_coef'].values + (df_train.loc[mask, 'usage_1'].values / 10).round(1)
df_train.loc[mask, df_train.columns[8:14]] = df_train.loc[mask, df_train.columns[9:15]].values
df_train.loc[mask, df_train.columns[14]] = -1
df_train[col_names][mask]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['mtr_coef'] = df_train['mtr_coef'].astype('float64')


Unnamed: 0,client_id,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,usage_3,usage_4,mtr_val_old,mtr_val_new,months_num,fraud
20211,train_Client_100551,11,1099471,0,467,6,1.5,1200,10744,0,0,17815,25778,-1,0.0
20212,train_Client_100551,11,1099471,0,467,9,1.5,1200,10566,0,0,9971,17815,-1,0.0
20213,train_Client_100551,11,1099471,0,467,9,1.5,1200,8790,0,0,3311,9971,-1,0.0
20214,train_Client_100551,11,1099471,0,467,9,1.5,1200,3024,0,0,495,3311,-1,0.0
20215,train_Client_100551,11,1099471,5,467,6,1.5,0,0,0,0,25778,25778,-1,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4457223,train_Client_99465,11,99072847,0,506,9,1.5,200,100,200,1810,459733,461273,-1,0.0
4457224,train_Client_99465,11,99072847,0,506,6,1.5,300,3633,0,0,361738,364360,-1,0.0
4457225,train_Client_99465,11,99072847,0,506,6,1.5,300,4327,0,0,364360,367445,-1,0.0
4457226,train_Client_99465,11,99072847,0,506,9,1.5,200,100,200,1321,458519,459733,-1,0.0


The usage needs to be recalculated with the coefficients

In [28]:
# Converting 'mtr_val_' into float for now, as decimal places get in the way otherwise
df_train['mtr_val_old'] = df_train['mtr_val_old'].astype(float).round(0)
df_train['mtr_val_new'] = df_train['mtr_val_new'].astype(float).round(0)
# Getting total usage (scaled by coef):
df_train.loc[:, 'usage_n'] = (df_train[['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1) / df_train['mtr_coef']).round(0)
df_train.loc[:, 'mtr_val_new_calc'] = df_train.loc[:, 'mtr_val_old'] + df_train.loc[:, 'usage_n'].round(0) # This is what is expected if all is working
mask = df_train['mtr_val_new'] != df_train['mtr_val_new_calc']
print(sum(mask))
df_temp = df_train[mask]
df_temp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['mtr_val_old'] = df_train['mtr_val_old'].astype(float).round(0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['mtr_val_new'] = df_train['mtr_val_new'].astype(float).round(0)


16389


Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,mtr_type,district,client_type,region,start_date,fraud,adj_flag,usage_n,mtr_val_new_calc,temp_flag
1120,train_Client_100024,2014-06-20,11,482429,0,207,8,1.0,800,400,...,ELEC,60,11,101,1985-03-22,0.0,False,1518.0,100778.0,False
1215,train_Client_100025,2012-02-17,11,1028965,0,203,6,1.0,372,0,...,ELEC,63,11,311,2005-12-15,0.0,False,372.0,7324.0,False
1865,train_Client_100050,2012-01-25,11,112753,0,413,9,1.0,2400,543,...,ELEC,63,11,311,1994-12-31,0.0,False,2943.0,10652.0,False
2632,train_Client_100075,2008-11-27,11,502690,0,410,6,1.0,116,0,...,ELEC,60,11,101,2007-09-13,0.0,False,116.0,116.0,False
2818,train_Client_10008,2015-05-01,11,842528,0,413,9,1.0,800,400,...,ELEC,63,11,101,1987-09-14,0.0,False,2363.0,101849.0,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4475524,train_Client_99964,2010-08-11,15,403585,0,202,9,1.0,215,0,...,ELEC,60,11,101,1978-10-04,0.0,False,215.0,4550.0,False
4475991,train_Client_99978,2011-08-12,40,7006449,0,5,6,1.0,395,0,...,GAZ,69,11,104,1995-10-30,0.0,False,395.0,2876.0,False
4476318,train_Client_99985,2006-10-18,11,974695,0,532,6,3.0,1200,48618,...,ELEC,60,51,101,1994-12-31,0.0,False,16606.0,116514.0,False
4476420,train_Client_99988,2012-05-01,40,70623,0,5,6,1.0,61,0,...,GAZ,63,11,101,2002-12-10,0.0,False,61.0,1611.0,False


There is an issue with how the dates were originally parsed prior to being uploaded to Kaggle. The months and days are in places switched, as deduced by considering 'mtr_val_old', 'mtr_val_new', and 'months_num'.

In [29]:
df_train.sort_values(by=['client_id', 'mtr_id', 'mtr_tariff', 'invoice_date'], inplace=True)
df_train['mtr_val_nxt'] = df_train.groupby(['client_id', 'mtr_id'])['mtr_val_old'].shift(-1)
df_train['temp_flag'] = (df_train['mtr_val_new'] > df_train['mtr_val_nxt']) & (df_train['mtr_val_nxt'].notna()) # Out of order
print(sum(df_train['temp_flag'])) # number of times rule is being broken
df_temp = df_train[df_train['temp_flag'] == True]
df_temp

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.sort_values(by=['client_id', 'mtr_id', 'mtr_tariff', 'invoice_date'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['mtr_val_nxt'] = df_train.groupby(['client_id', 'mtr_id'])['mtr_val_old'].shift(-1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['temp_flag'] = (df_train['mtr_val_new'] > df_train['mtr_val_nxt']) 

598574


Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,district,client_type,region,start_date,fraud,adj_flag,usage_n,mtr_val_new_calc,temp_flag,mtr_val_nxt
19,train_Client_0,2013-02-12,11,1335667,0,203,8,1.0,38,0,...,60,11,101,1994-12-31,0.0,False,38.0,14302.0,True,12294.0
43,train_Client_1,2008-02-07,11,678902,0,203,6,1.0,528,0,...,69,11,107,2002-05-29,0.0,False,528.0,9132.0,True,8049.0
56,train_Client_1,2009-03-11,11,678902,0,203,6,1.0,1207,0,...,69,11,107,2002-05-29,0.0,False,1207.0,11429.0,True,9724.0
57,train_Client_1,2010-05-10,11,678902,0,203,6,1.0,520,0,...,69,11,107,2002-05-29,0.0,False,520.0,12781.0,True,11817.0
40,train_Client_1,2010-07-06,11,678902,0,203,6,1.0,444,0,...,69,11,107,2002-05-29,0.0,False,444.0,12261.0,True,11429.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476721,train_Client_99997,2016-03-06,40,0,0,5,9,1.0,59,0,...,63,11,311,2011-11-22,0.0,False,59.0,921.0,True,796.0
4476727,train_Client_99997,2013-11-10,11,262195,0,207,9,1.0,331,0,...,63,11,311,2011-11-22,0.0,False,331.0,2221.0,True,1608.0
4476720,train_Client_99997,2016-03-06,11,262195,0,207,9,1.0,372,0,...,63,11,311,2011-11-22,0.0,False,372.0,4923.0,True,4186.0
4476724,train_Client_99997,2017-02-06,11,262195,0,207,9,1.0,369,0,...,63,11,311,2011-11-22,0.0,False,369.0,6087.0,True,5320.0


In [30]:
mask = (df_train['invoice_date'].dt.day <= 12) & (df_train['invoice_date'].dt.month <= 12)
print(sum(mask)) # Count of times date issue could have occurred: ~1/3 of times
print(sum((df_train['temp_flag'] == True) & mask))

1976616
535147


In [31]:
# Flip the day and month for these cases. There might be a vectorised way for this to speed this up...
df_train.loc[mask, 'adj_flag'] = True
df_train.loc[mask, 'invoice_date'] = df_train.loc[mask, 'invoice_date'].apply(
    lambda x: x.replace(day=x.month, month=x.day) if pd.notna(x) else x)
# Re-sort and check for the same rule again
col_names = ['client_id', 'mtr_id', 'mtr_tariff']
df_train.sort_values(by=['client_id', 'mtr_id', 'mtr_tariff', 'invoice_date'], inplace=True)
df_train['mtr_val_nxt'] = df_train.groupby(col_names)['mtr_val_old'].shift(-1)
df_train['temp_flag'] = (df_train['mtr_val_new'] > df_train['mtr_val_nxt']) & (df_train['mtr_val_nxt'].notna())
print(sum(df_train['temp_flag']))
df_temp = df_train[df_train['temp_flag'] == True]
df_temp

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.sort_values(by=['client_id', 'mtr_id', 'mtr_tariff', 'invoice_date'], inplace=True)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['mtr_val_nxt'] = df_train.groupby(col_names)['mtr_val_old'].shift(-1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['temp_flag'] = (df_train['mtr_val_new'] > df_train['mtr_val_nxt']) & (df_train['m

12031


Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,district,client_type,region,start_date,fraud,adj_flag,usage_n,mtr_val_new_calc,temp_flag,mtr_val_nxt
287,train_Client_100001,2011-08-09,40,126093,0,5,9,1.0,206,0,...,60,11,101,2006-04-12,0.0,True,206.0,1893.0,True,1804.0
1613,train_Client_100044,2012-04-06,10,49967888,0,202,6,1.0,200,54,...,62,11,301,1984-10-31,0.0,True,254.0,9977.0,True,219.0
2430,train_Client_100071,2012-10-22,11,455964,0,203,9,1.0,1146,0,...,62,11,301,1985-05-16,0.0,False,1146.0,52790.0,True,52290.0
2534,train_Client_100073,2010-05-05,40,6976331,0,5,9,1.0,791,0,...,62,11,305,1983-12-21,0.0,True,791.0,853.0,True,88.0
3126,train_Client_100087,2008-10-28,40,6736574,0,5,6,1.0,807,0,...,69,11,107,2008-03-06,0.0,False,807.0,807.0,True,753.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4474037,train_Client_99927,2015-04-07,40,48255,0,5,9,1.0,412,0,...,62,11,301,2001-07-20,0.0,True,412.0,2490.0,True,2308.0
4474286,train_Client_99931,2011-10-12,11,10209,0,413,9,1.0,1200,306,...,63,11,311,1996-11-20,0.0,True,1506.0,23704.0,True,23104.0
4474545,train_Client_99938,2013-06-18,11,413121,1,203,9,1.0,359,0,...,63,11,311,2012-12-29,0.0,False,359.0,359.0,True,100.0
4475292,train_Client_99958,2007-03-13,40,54976,0,5,6,1.0,0,0,...,60,11,101,2001-12-24,0.0,False,0.0,1094.0,True,1073.0


Out of 1976616 Rows that were flipped, the number of flagged rows dropped from 535147 to 12031. This is considered much better than if randomly swapped. Although, it must still be recognised that these were not manually checked and it could be subsets should not have been flipped. 

There is also a "rollover" issue with the 'mtr_val_' columns, where some reset after exceeding 100000 | 1000000 back down to 0. Adding the amount to it to fix. As it is not clear what the cap would be, have to infer based on what caused the rollover. There is also an issue with 'mtr_val_' not updating and with the arithmetic not matching, these will be ignored for now.

In [32]:
mask = (((df_train['mtr_val_old'] != df_train['mtr_val_new']) & (df_train['usage_n'] > 0)) & # Ignoring case where meter seems to not be updating
        (df_train['mtr_val_new_calc'].astype(str).str.len() > df_train['mtr_val_new'].astype(str).str.len())) # Looking for number of digits increasing 
print(sum(mask))
df_train['temp_flag'] = mask
df_temp = df_train[mask]
df_temp

2310


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['temp_flag'] = mask


Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,district,client_type,region,start_date,fraud,adj_flag,usage_n,mtr_val_new_calc,temp_flag,mtr_val_nxt
1120,train_Client_100024,2014-06-20,11,482429,0,207,8,1.0,800,400,...,60,11,101,1985-03-22,0.0,False,1518.0,100778.0,True,778.0
2818,train_Client_10008,2015-01-05,11,842528,0,413,9,1.0,800,400,...,63,11,101,1987-09-14,0.0,True,2363.0,101849.0,True,1849.0
5257,train_Client_100144,2012-04-06,11,543494,0,207,6,1.0,600,99398,...,69,11,104,2001-02-28,0.0,True,99998.0,119770.0,True,19772.0
5751,train_Client_100157,2018-02-12,11,35572,0,420,9,1.0,800,400,...,60,11,101,2011-07-11,0.0,True,6086.0,103185.0,True,3185.0
9097,train_Client_100257,2008-01-25,10,7292038,0,202,9,1.0,200,3150,...,60,11,101,2007-10-01,0.0,False,3350.0,13147.0,True,3147.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4458606,train_Client_99506,2019-04-15,11,14604,0,433,9,1.0,1600,800,...,69,11,104,1998-09-08,0.0,False,5823.0,103055.0,True,
4461423,train_Client_99580,2011-03-25,11,10907,0,410,6,1.0,1200,1518,...,62,12,309,1990-06-19,0.0,False,2718.0,102054.0,True,2054.0
4461805,train_Client_9959,2010-06-18,11,1167412,0,207,6,1.0,1200,1491,...,60,11,101,1981-05-14,0.0,False,2691.0,101073.0,True,1073.0
4466699,train_Client_99733,2012-04-03,11,800241,0,413,8,1.0,4800,3079,...,69,11,104,1986-01-20,0.0,True,7879.0,100185.0,True,593.0


In [33]:
mask = (df_train['temp_flag'] == True) & (((df_train['mtr_val_new_calc'] - df_train['mtr_val_new']).astype(str).str.len().apply(lambda x: 10**(x-3)) + df_train['mtr_val_new']) != df_train['mtr_val_new_calc'])
df_train['err_flag'] = mask
print(sum(mask))
df_temp = df_train[mask]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train['err_flag'] = mask


43


In [None]:
df_temp = df_train[((df_train['mtr_val_old'] == df_train['mtr_val_new']) & (df_train['usage_n'] > 0))].copy()
df_temp['temp_flag'] = False
df_temp.loc[(df_temp['mtr_val_nxt'] == df_temp['mtr_val_new_calc']) | (df_temp['mtr_val_nxt'].isna()), 'temp_flag'] = True

# If you can fix counter, do it

: 

Once a rollover occurs, it has a knock-on effects. One issue is that multiple rollovers can occur. 
The first instance is occurs, only 'mtr_val_new' needs adjusting, but then all rows after need both that and 'mtr_val_old' adjusting.
The first instance is flagged, and an amount is subtracted from 'mtr_val_old' (which will later be added back).
Then, the instances are cumulated, and then they are all added to 'mtr_val_old' and 'mtr_val_new'.
The expected 'mtr_val_new_calc' is recalculated after this once more.

In [33]:
mask_2 = df_train[mask].apply(lambda row: str(row['mtr_val_new_calc']).endswith(str(row['mtr_val_new'])),axis=1) # Check that the end digits match
df_train.loc[mask & mask_2, 'mtr_roll_flag'] = df_train.loc[mask & mask_2, 'mtr_val_new_calc'] - df_train.loc[mask & mask_2, 'mtr_val_new'] # First instances
df_train.loc[df_train['mtr_roll_flag'].notna(), 'mtr_val_old'] = df_train.loc[df_train['mtr_roll_flag'].notna(), 'mtr_val_old'] - df_train.loc[df_train['mtr_roll_flag'].notna(), 'mtr_roll_flag'] # This is to undo adding it shortly
df_train.loc[df_train['mtr_roll_flag'].notna(), 'mtr_roll_flag'] = df_train[df_train['mtr_roll_flag'].notna()].groupby(col_names)['mtr_roll_flag'].cumsum() # Count number of instances
df_train['mtr_roll_flag'] = df_train.groupby(col_names)['mtr_roll_flag'].fillna(method='ffill')
df_train['mtr_roll_flag'] = df_train['mtr_roll_flag'].fillna(0)
df_train['mtr_val_old'] = df_train['mtr_val_old'] + df_train['mtr_roll_flag'] # This will now not affect the first instance as it was already substracted by same amount.
df_train['mtr_val_new'] = df_train['mtr_val_new'] + df_train['mtr_roll_flag']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_train.loc[mask & mask_2, 'mtr_roll_flag'] = df_train.loc[mask & mask_2, 'mtr_val_new_calc'] - df_train.loc[mask & mask_2, 'mtr_val_new'] # First instances
  df_train['mtr_roll_flag'] = df_train.groupby(col_names)['mtr_roll_flag'].fillna(method='ffill')
  df_train['mtr_roll_flag'] = df_train.groupby(col_names)['mtr_roll_flag'].fillna(method='ffill')


: 

In [31]:
# Recalculate with updated values
df_train['mtr_val_new_calc'] = (df_train['mtr_val_old'] + df_train['usage_n']).astype(int)

Remaining rows with issues:

In [43]:
df_train['mtr_stuck_flag'] = (df_train['mtr_val_new_calc'] != df_train['mtr_val_new']) & (df_temp['mtr_val_old'] == df_temp['mtr_val_new'])
df_train['usage_stuck_flag'] = (df_train['mtr_val_new_calc'] != df_train['mtr_val_new']) & (df_temp['usage_n'] == 0)

In [37]:
mask = df_train['mtr_val_new_calc'] != df_train['mtr_val_new']
df_temp = df_train[mask]
df_temp

Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,client_type,region,start_date,fraud,mtr_val_nxt,mtr_val_flag,date_flag,usage_n,mtr_val_new_calc,mtr_roll_flag
1215,train_Client_100025,2012-02-17,11,1028965,0,203,6,1.0,372,0,...,11,311,2005-12-15,0.0,7300.0,False,False,372.0,7324,0.0
1865,train_Client_100050,2012-01-25,11,112753,0,413,9,1.0,2400,543,...,11,311,1994-12-31,0.0,7709.0,False,False,2943.0,10652,0.0
2632,train_Client_100075,2008-11-27,11,502690,0,410,6,1.0,116,0,...,11,101,2007-09-13,0.0,116.0,False,False,116.0,116,0.0
2865,train_Client_100080,2009-06-11,14,630280,0,413,6,1.0,5000,0,...,11,371,1984-12-13,0.0,3143.0,False,True,5000.0,8143,0.0
3004,train_Client_100083,2012-02-16,40,6788434,0,5,8,1.0,204,0,...,11,107,1982-10-26,1.0,1469.0,False,False,204.0,1673,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4475524,train_Client_99964,2010-11-08,15,403585,0,202,9,1.0,215,0,...,11,101,1978-10-04,0.0,4335.0,False,True,215.0,4550,0.0
4475523,train_Client_99964,2010-11-08,40,6856745,0,5,6,1.0,73,0,...,11,101,1978-10-04,0.0,238.0,False,True,73.0,311,0.0
4475991,train_Client_99978,2011-12-08,40,7006449,0,5,6,1.0,395,0,...,11,104,1995-10-30,0.0,4118.0,False,True,395.0,2876,0.0
4476420,train_Client_99988,2012-01-05,40,70623,0,5,6,1.0,61,0,...,11,101,2002-12-10,0.0,1649.0,False,True,61.0,1611,0.0


About half have meter not updating:

In [38]:
mask = df_temp['mtr_val_old'] != df_temp['mtr_val_new']
df_temp = df_temp[mask]
df_temp

Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,client_type,region,start_date,fraud,mtr_val_nxt,mtr_val_flag,date_flag,usage_n,mtr_val_new_calc,mtr_roll_flag
1215,train_Client_100025,2012-02-17,11,1028965,0,203,6,1.0,372,0,...,11,311,2005-12-15,0.0,7300.0,False,False,372.0,7324,0.0
5579,train_Client_100151,2012-02-06,11,1211160,0,203,8,1.0,616,0,...,11,107,1993-09-29,0.0,24347.0,False,True,616.0,24469,0.0
7571,train_Client_100216,2012-02-27,11,1170937,0,203,8,1.0,336,0,...,11,311,1992-11-24,1.0,4019.0,False,False,336.0,4111,0.0
8560,train_Client_100243,2012-02-07,40,6983721,0,5,9,1.0,28,0,...,11,311,1993-02-10,0.0,327.0,False,True,28.0,284,0.0
8624,train_Client_100244,2014-05-15,11,570807,5,413,6,1.0,0,0,...,12,103,1991-06-25,0.0,,False,False,0.0,7110,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4473770,train_Client_99919,2018-08-14,11,378,0,410,9,1.0,379,0,...,11,301,1983-01-10,0.0,8988.0,False,False,379.0,8431,0.0
4473789,train_Client_99919,2018-12-21,11,378,0,410,9,1.0,489,0,...,11,301,1983-01-10,0.0,9890.0,False,False,489.0,9477,0.0
4473779,train_Client_99919,2019-04-22,11,378,0,410,9,1.0,348,0,...,11,301,1983-01-10,0.0,10653.0,False,False,348.0,10238,0.0
4473737,train_Client_99919,2019-08-26,11,378,0,410,9,1.0,671,0,...,11,301,1983-01-10,0.0,,False,False,671.0,11324,0.0


In [40]:
mask = df_temp['usage_n'] > 0
df_temp[mask]

Unnamed: 0,client_id,invoice_date,mtr_tariff,mtr_id,mtr_status,mtr_code,mtr_notes,mtr_coef,usage_1,usage_2,...,client_type,region,start_date,fraud,mtr_val_nxt,mtr_val_flag,date_flag,usage_n,mtr_val_new_calc,mtr_roll_flag
1215,train_Client_100025,2012-02-17,11,1028965,0,203,6,1.0,372,0,...,11,311,2005-12-15,0.0,7300.0,False,False,372.0,7324,0.0
5579,train_Client_100151,2012-02-06,11,1211160,0,203,8,1.0,616,0,...,11,107,1993-09-29,0.0,24347.0,False,True,616.0,24469,0.0
7571,train_Client_100216,2012-02-27,11,1170937,0,203,8,1.0,336,0,...,11,311,1992-11-24,1.0,4019.0,False,False,336.0,4111,0.0
8560,train_Client_100243,2012-02-07,40,6983721,0,5,9,1.0,28,0,...,11,311,1993-02-10,0.0,327.0,False,True,28.0,284,0.0
8625,train_Client_100244,2014-05-15,11,630735,0,413,8,1.0,400,200,...,12,103,1991-06-25,0.0,119.0,False,False,646.0,646,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4473742,train_Client_99919,2017-08-16,11,378,0,410,9,1.0,404,0,...,11,301,1983-01-10,0.0,8052.0,False,False,404.0,5567,0.0
4473770,train_Client_99919,2018-08-14,11,378,0,410,9,1.0,379,0,...,11,301,1983-01-10,0.0,8988.0,False,False,379.0,8431,0.0
4473789,train_Client_99919,2018-12-21,11,378,0,410,9,1.0,489,0,...,11,301,1983-01-10,0.0,9890.0,False,False,489.0,9477,0.0
4473779,train_Client_99919,2019-04-22,11,378,0,410,9,1.0,348,0,...,11,301,1983-01-10,0.0,10653.0,False,False,348.0,10238,0.0


Checking if these changes worked: ((Sum of usage) / mtr_coef) == diff(mtr_val_new, mtr_val_old)

In [None]:
mask = (df_train['mtr_val_new'] - ((df_train[['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1) / df_train['mtr_coef']) + df_train['mtr_val_old'])) != 0
df_train[mask][col_names]

In [None]:
df_temp = df_train[['start_date', 'invoice_date', 'mtr_tariff', 'mtr_id', 'mtr_code']].copy()
df_temp['id_len'] = df_temp['mtr_id'].astype(str).str.len()
df_temp['code_len'] = df_temp['mtr_code'].astype(str).str.len()
df_temp['sum'] = df_temp['mtr_id'] + df_temp['mtr_code']

In [None]:
df_temp.iloc[:,3]

In [None]:
df_train['mtr_code'].value_counts()

In [None]:
df_train[mask][['mtr_coef', 'usage_1']].value_counts().sort_index()

For these, it seems as though 'mtr_coef' was accidently split due to decimal point.

In [None]:
df_train[df_train['client_id'] == 'train_Client_53725']

rule 1: IF L4: L3 = L1 & L2 = L1/2

In [None]:
mask_L4 = df_train['usage_4'] > 0
print((mask_L4.sum()))
mask_L4_L3 = df_train['usage_3'][mask_L4] != df_train['usage_1'][mask_L4]
df_train[['client_id', 'mtr_id', 'mtr_code', 'mtr_tariff', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'fraud']][mask_L4][mask_L4_L3]

For these, it seems as through 

In [None]:
mask_L4_L2 = df_train['usage_2'][mask_L4] * 2 != df_train['usage_1'][mask_L4]
df_train[['client_id', 'mtr_id', 'mtr_code', 'mtr_tariff', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'fraud']][mask_L4][mask_L4_L2]

This rule is broken 102 / 92958 times.

rule 2: IF L3: L2 = L1/2

In [None]:
mask_L3 = df_train['usage_3'] > 0
print((mask_L3.sum()))
mask_L3_L2 = df_train['usage_2'][mask_L3] * 2 != df_train['usage_1'][mask_L3]
df_train[['client_id', 'mtr_id', 'mtr_code', 'mtr_tariff', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'fraud']][mask_L3][mask_L3_L2]

In [None]:
df_train['usage_1'][mask_L3][mask_L3_L2].unique()

This rule is broken 422 / 183358

rule 3: IF L2: L1 = L1_Max (% 50 == TRUE)

In [None]:
mask_L2 = df_train['usage_2'] > 0
print((mask_L2.sum()))
mask_L2_L1 = df_train['usage_1'][mask_L2] % 50 != 0
df_train[['client_id', 'mtr_id', 'mtr_code', 'mtr_tariff', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'fraud']][mask_L2][mask_L2_L1]

In [None]:
df_train['usage_1'][mask_L2][mask_L2_L1].unique()

This rule is broken 1305 / 660570

Rule 4: IF L2 & L1 < 10: Cannot rely on L1 relationships.

These rules can circumstantially tell us which Ln are being used, and if it is capped. If the expected cap is not yet exceeded, it would not know.
There is some further information based on the tariff_type.  

Rule 5: IF T_T == 10 | 11: L3 & L4 Possible
Rule 6: IF T_T == 10 | 11 | 40 | 45: L2 Possible
Else: Only L1 Possible

In [None]:
print((mask_L2.sum()))
df_train[['client_id', 'mtr_id', 'mtr_code', 'mtr_tariff', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'fraud']][mask_L2 & ~(df_train['mtr_tariff'].isin([10, 11, 40, 45]))]

In [None]:
print((mask_L3.sum()))
df_train[['client_id', 'mtr_id', 'mtr_code', 'mtr_tariff', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'fraud']][mask_L3 & ~(df_train['mtr_tariff'].isin([10, 11]))]

In [None]:
(df_train['usage_3'][mask_L4] < 100).sum()

There remains: "mtr_code", "mtr_status", "mtr_notes" and "mtr_coef" to understand.

mtr_code does not seem to give anything extra on this topic beyond tariff_type. 
mtr_coef is said to be added when standard consumption is exceeded, however, it is very very rarely used.
It is also difficult to determine the logic in its application.

In [None]:
contingency_table = pd.crosstab(df_train['mtr_code'], df_train['mtr_tariff'])

sns.heatmap(contingency_table, annot=True, fmt="d", cmap="Blues", cbar=True)
plt.title("Relationship Between mtr_code and mtr_tariff")
plt.xlabel("Category2")
plt.ylabel("Category1")
plt.show()

In [None]:
sns.scatterplot(x=df_train['mtr_code'], y=df_train['mtr_tariff'])

There is clearly a link between tariff_type and mtr_code. But I cannot understand it programmatically.

In [None]:
df_train[df_train['mtr_tariff'] == 45].boxplot(column='usage_1', by='mtr_code', grid=False, showmeans=True, vert = False)
plt.ylim(4, 9)
plt.xlim(0, 50000)

In [None]:
df_train[df_train['mtr_tariff'] == 45].boxplot(column='usage_2', by='mtr_code', grid=False, showmeans=True, vert = False)
plt.ylim(4, 9)

In [None]:
contingency_table

In [None]:
print(df_train['mtr_tariff'].nunique())
print(df_train['mtr_code'].nunique())

In [None]:
df_train[['mtr_tariff', 'mtr_code']].value_counts()

In [None]:
df_temp = df_train.groupby(['mtr_coef','fraud']).agg({'client_id': 'nunique'})

df_temp = df_temp.reset_index().pivot(index='mtr_coef', columns='fraud', values='client_id').fillna(0)

df_temp.index = df_temp.index.astype(str)

# Plot the data
plt.figure(figsize=(10, 6))
plt.bar(df_temp.index, df_temp[1], label='Fraud = 1', color = "red")
plt.bar(df_temp.index, df_temp[0], bottom=df_temp[1], label='Fraud = 0', color = "lightgreen")
plt.legend()
plt.ylim(0, 25)

In [None]:
df_temp = df_train.groupby(['mtr_coef','fraud']).agg({'client_id': 'nunique'})

df_temp = df_temp.reset_index().pivot(index='mtr_coef', columns='fraud', values='client_id').fillna(0)

df_temp.index = df_temp.index.astype(str)

# Plot the data
plt.figure(figsize=(10, 6))
plt.bar(df_temp.index, 100*df_temp[1]/(df_temp[0] + df_temp[1]), label='Fraud = 1', color = "red")
plt.bar(df_temp.index, 100*df_temp[0]/(df_temp[0] + df_temp[1]), bottom=100*df_temp[1]/(df_temp[0] + df_temp[1]), label='Fraud = 0', color = "lightgreen")
plt.legend()
plt.ylim(0, 100)

In [None]:
df_temp.sum()

Hard to know what to do with this. This will pretty much assume mtr_coef > 1 == No Fraud with only 2 / 1646 breaking this rule.
135,459 / 143,028 are mtr_coef = 1; there is not enough data on the others to know conclude anything, even if proportionally they seem significant.
(Going to condense to 0 | 1 | Other. Do this later.)

In [None]:
df_temp = df_train.groupby(['mtr_notes','fraud']).agg({'client_id': 'nunique'})

df_temp = df_temp.reset_index().pivot(index='mtr_notes', columns='fraud', values='client_id').fillna(0)

df_temp.index = df_temp.index.astype(str)

# Plot the data
plt.figure(figsize=(10, 6))
plt.bar(df_temp.index, df_temp[1], label='Fraud = 1', color = "red")
plt.bar(df_temp.index, df_temp[0], bottom=df_temp[1], label='Fraud = 0', color = "lightgreen")

plt.legend()
plt.ylim(0, 50)

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(df_temp.index, 100*df_temp[1]/(df_temp[0] + df_temp[1]), label='Fraud = 1', color = "red")
plt.bar(df_temp.index, 100*df_temp[0]/(df_temp[0] + df_temp[1]), bottom=100*df_temp[1]/(df_temp[0] + df_temp[1]), label='Fraud = 0', color = "lightgreen")

plt.legend()
plt.ylim(0, 100)

Looking at mtr_notes: there is not enough of 5 | 203 | 207 | 413 to conclude anything. However, 6 | 7 | 8 | 9 do seem relevant. From these, 7 seems quite different to the other three.
(Going to condense to 6 | 7 | 8 | 9 | Other. Do this later.)

In [None]:
df_temp = df_train[df_train['mtr_notes'].isin([6, 7, 8, 9])]
df_temp = df_temp[df_temp['mtr_coef'].isin([0, 1, 5, 10])]
df_temp['mtr_notes'] = df_temp['mtr_notes'].astype(int)
df_temp['mtr_coef'] = df_temp['mtr_coef'].astype(int)

In [None]:
df_temp.groupby(['mtr_coef', 'mtr_notes','fraud']).agg({'client_id': 'nunique'})

Quick check to see no interlink between coef and notes: does not seem to be.

In [None]:
df_temp = df_train.groupby(['mtr_status','fraud']).agg({'client_id': 'nunique'})

df_temp = df_temp.reset_index().pivot(index='mtr_status', columns='fraud', values='client_id').fillna(0)

df_temp.index = df_temp.index.astype(str)

# Plot the data
plt.figure(figsize=(10, 6))
plt.bar(df_temp.index, df_temp[1], label='Fraud = 1', color = "red")
plt.bar(df_temp.index, df_temp[0], bottom=df_temp[1], label='Fraud = 0', color = "lightgreen")

plt.legend()
plt.ylim(0, 5000)

In [None]:
plt.figure(figsize=(10, 6))
plt.bar(df_temp.index, 100*df_temp[1]/(df_temp[0] + df_temp[1]), label='Fraud = 1', color = "red")
plt.bar(df_temp.index, 100*df_temp[0]/(df_temp[0] + df_temp[1]), bottom=100*df_temp[1]/(df_temp[0] + df_temp[1]), label='Fraud = 0', color = "lightgreen")

plt.legend()
plt.ylim(0, 100)

In [None]:
df_temp = df_train[df_train['mtr_notes'].isin([6, 7, 8, 9])]
df_temp['mtr_status'] = df_temp['mtr_status'].astype(str)
df_temp = df_temp[df_temp['mtr_status'].isin(['0', '1', '2', '3', '4', '5'])]
df_temp['mtr_notes'] = df_temp['mtr_notes'].astype(int)

df_temp = df_temp[['client_id', 'mtr_id', 'fraud','mtr_notes', 'mtr_status']]

Getting summed energy usage per row shows quite clearly the intended interpretation of mtr_val_old and mtr_val_new. 

In [None]:
df_train['usage_n'] = df_train[['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1) / df_train['mtr_coef'].astype(int)

In [None]:
df_train[((df_train['mtr_val_new'] - df_train['mtr_val_old']) - df_train['usage_n']) != 0]

In the cases where it is not true, what do we see?

In [None]:
df_temp = df_train[((df_train['mtr_val_new'] - df_train['mtr_val_old']) - df_train['usage_n']) != 0]
len(df_temp)

One edge case is max digit roll-over: 

In [None]:
df_temp[(df_temp['mtr_val_old'] > 900000) & (df_temp['mtr_val_new'] < 50000)]

In [None]:
mask = ((df_temp['mtr_val_old'] > 900000) & (df_temp['mtr_val_new'] < 50000))
((df_temp['mtr_val_old'][mask] + df_temp['usage_n'][mask]) - (df_temp['mtr_val_new'][mask] + 1000000))

Seems like there is data corruption here. I do not know for certain, but it seems like a data entry issue when mtr_coef had a decimal. For example: 1.3 became 1 and 3. With the 3 offsetting the rest of the data.

In [None]:
col_names = ['mtr_coef', 'usage_1', 'usage_2', 'usage_3', 'usage_4', 'mtr_val_old', 'mtr_val_new', 'months_num', 'usage_n']
mask = (df_temp['usage_1'] < 10) & (df_temp['usage_2'] > 0) 
df_temp[col_names][mask]

In [None]:
mask_2 = abs(df_temp['months_num'][mask] - (((df_temp[['usage_2', 'usage_3', 'usage_4', 'mtr_val_old']][mask].sum(axis=1)) / (df_temp['mtr_coef'][mask].astype(int) + (df_temp['usage_1'][mask] / 10))) + df_temp['mtr_val_new'][mask])) > 1

In [None]:
df_temp[['mtr_tariff', 'mtr_id', 'mtr_status', 'mtr_code', 'mtr_notes']+col_names][mask][mask_2]

Look at the roll over again for this subset

In [None]:
mask_3 = ((df_temp['mtr_val_new'][mask][mask_2] > 60000) & (df_temp['months_num'][mask][mask_2] < 50000))
#((df_temp['mtr_val_old'][mask] + df_temp['usage_n'][mask]) - (df_temp['mtr_val_new'][mask] + 1000000))
df_temp[['mtr_tariff', 'mtr_id', 'mtr_status', 'mtr_code', 'mtr_notes']+col_names][mask][mask_2][mask_3]

In [None]:
df_train['mtr_coef'].value_counts()

In [None]:
df_train['mtr_notes'].value_counts()

Going to assume 6-9 are intended to be here. The rest are perhaps from code which has many more 203, 413, 207, and 5. Code is 0 33 times, these values sum to 34.

In [None]:
df_train['mtr_code'].value_counts()

In [None]:
df_train[df_train['mtr_code'] == 0][['client_id', 'mtr_tariff', 'mtr_id', 'mtr_status', 'mtr_code', 'mtr_notes']+col_names]

In [None]:
df_pivot = df_temp.groupby(['mtr_notes','mtr_status','fraud']).agg({'client_id': 'nunique'})

df_pivot = df_pivot.reset_index()
df_pivot = df_pivot.pivot(index=['mtr_notes', 'mtr_status'], columns='fraud', values='client_id').fillna(0)

df_pivot

# Rename columns for clarity (Fraud = 0 and Fraud = 1)
df_pivot.columns = ['Not_Fraud', 'Fraud']

# Calculate proportions
df_pivot['Fraud_Proportion'] = df_pivot['Fraud'] / (df_pivot['Fraud'] + df_pivot['Not_Fraud']) * 100
df_pivot['Not_Fraud_Proportion'] = df_pivot['Not_Fraud'] / (df_pivot['Fraud'] + df_pivot['Not_Fraud']) * 100

# Reset index to make it easier to read or plot
df_pivot.reset_index(inplace=True)

df_pivot


There are differences in the proportions when considering notes and status.

Looking at a client record set would look like

In [None]:
df_temp = df_train[df_train['client_id'] == "train_Client_1"]
df_temp = df_temp.sort_values(by='invoice_date') 

In [None]:
df_temp

Trying to understand months_num

In [None]:
pd.concat([df_temp['invoice_date'].diff(), df_temp['months_num']*30, df_temp['invoice_date'].diff(periods=-1)], axis=1)

In [None]:
df_train['months_num'].value_counts().head(60)

There is seemingly a link between old_idx | new_idx and in places months_num. Sometimes, like a linked list. Not sure what to do with these for now. Temporarily will ignore.
I am worried these are an unreliable features and artifacts of how the data was retrieved.

In [None]:
#plt.pyplot.boxplot(df_train['mtr_code'], df_train['usage_lev_4'])
df_train.boxplot(column='old_idx', by='fraud', grid=False, showmeans=True, vert = False)
plt.xlim(0, 100000)

In [None]:
#plt.pyplot.boxplot(df_train['mtr_code'], df_train['usage_lev_4'])
df_train.boxplot(column='new_idx', by='fraud', grid=False, showmeans=True, vert = False)
plt.xlim(0, 100000)

In [None]:
df_temp = pd.concat([df_train['new_idx'] - df_train['old_idx'], df_train['fraud']], axis=1)

In [None]:
df_temp

In [None]:
df_temp.boxplot(by = 'fraud')

In [None]:
print(df_train['client_id'].nunique())
print(df_train['invoice_date'].min())
print(df_train['invoice_date'].max())

135.5k Clients between 1977 - 2020.

Might be useful to have a measure of normalised usage

It seems as though the usage is more dependent on the months_num that the difference in invoice date. This is an issue because, months_num seems inconsistent

In [None]:
#df_temp = df_train[(df_train['client_id'] == "train_Client_135089") & (df_train['mtr_id'] == 205552)] # train_Client_123007
df_temp = df_train[(df_train['client_id'] == "train_Client_130431") & (df_train['mtr_id'] == 521409)] 


df_temp = df_temp.sort_values(by='invoice_date')
len(df_temp)

In [None]:
df_temp['usage_n'] = df_temp[['usage_1', 'usage_2', 'usage_3', 'usage_4']].sum(axis=1)
df_temp['usage_N'] = df_temp['usage_n'].cumsum()
df_temp['invoice_date_d'] = df_temp['invoice_date'].diff().fillna(pd.Timedelta(seconds=0))
df_temp['usage_n_daily'] = df_temp['usage_n'] / df_temp['invoice_date_d'].dt.days
df_temp['usage_n_mnum'] = df_temp['usage_n'] / (df_temp['months_num'] * 30)
df_temp['m_num'] = df_temp['months_num']

In [None]:
col_names = ['invoice_date', 'mtr_id', 'mtr_code', 'mtr_tariff', 'old_idx', 'new_idx', 'months_num', 'usage_n', 'usage_N', 'invoice_date_d', 'usage_n_daily', 'usage_n_mnum', 'fraud']
df_temp[col_names]

In [None]:
# Plot the data
plt.figure(figsize=(10, 6))
plt.bar(df_temp[0], df_temp[1], label='Fraud = 1', color = "red")
#plt.bar(df_temp.index, df_temp[0], bottom=df_temp[1], label='Fraud = 0', color = "lightgreen")

plt.legend()
plt.ylim(0, 5000)

In [None]:
#df_temp[['mtr_notes', 'mtr_status', 'fraud']].value_counts().sort_index()

(df_temp[df_temp['fraud'] == 0][['mtr_notes', 'mtr_status']].value_counts().sort_index() +
                                                                                           df_temp[df_temp['fraud'] == 1][['mtr_notes', 'mtr_status']].value_counts().sort_index())

In [None]:
df_train[['fraud', 'mtr_notes']].value_counts()

In [None]:
df_train.groupby(['client_id', 'mtr_id']).agg({'mtr_coef' : 'nunique', 'fraud' : 'nunique'}).value_counts()

In [None]:
df_train[['client_id', 'mtr_id', 'mtr_coef']][df_train['mtr_coef'] != 1].value_counts().sort_index().head(60)

In [None]:
df_train[['usage_1', 'usage_2', 'usage_3', 'usage_4']].describe()

In [None]:
df_train['months_num']

In [None]:
print(f"Months Number: Min: {df_train['months_num'].min()}, Max: {df_train['months_num'].max()}", end="")

In [None]:
df_train['months_num'].value_counts()

In [None]:
df_train['months_num'][df_train['months_num'] > 12].value_counts()

In [None]:
plt.pyplot.hist(df_train['months_num'][(df_train['months_num'] > 12) & (df_train['months_num'] < 60)])

In [None]:
# Testing to see if it is related to count of months they've been customer
((df_train['invoice_date'].dt.year - df_train['creation_date'].dt.year) * 12) + (df_train['invoice_date'].dt.month - df_train['creation_date'].dt.month)

In [None]:
df_train['months_num']

It's is really unclear what months_num is meant to represent...

It is similarly unclear what "old_idx" and "new_idx" are meant to represent.

In [None]:
df_train.head()

In [None]:
col_names = ['mtr_coef', 'usage_lev_1', 'usage_lev_2', 'usage_lev_3', 'usage_lev_4', 'target', 'tariff_type', 'mtr_status']
df_train[col_names].head()

In [None]:
plt.pyplot.hist(df_train[['usage_lev_1', 'usage_lev_2', 'usage_lev_3', 'usage_lev_4']])

In [None]:
df_train['mtr_coef'][df_train['mtr_coef'] != 1]

In [None]:
sum(df_train['usage_lev_4'][df_train['mtr_coef'] != 1])

In [None]:
col_names = ['mtr_coef', 'usage_lev_1', 'usage_lev_2', 'usage_lev_3', 'usage_lev_4', 'target', 'tariff_type', 'mtr_status', 'mtr_code', 'counter_type']
df_temp = df_train[col_names]
df_temp[(df_temp['usage_lev_2'] > 0) | (df_temp['usage_lev_3'] > 0) | (df_temp['usage_lev_4'] > 0)]

Trying to get a better sense of what the usage levels mean

In [None]:
df_temp = df_train[(df_train['usage_lev_1'] % 100 == 0)]

In [None]:
temp = df_temp['usage_lev_1'].value_counts()

In [None]:
temp[temp > 3]

In [None]:
df_train['mtr_coef'].value_counts()

In [None]:
#plt.pyplot.boxplot(df_train['mtr_code'], df_train['usage_lev_4'])
df_train[df_train['usage_lev_4'] > 0].boxplot(column='usage_lev_4', by='mtr_code', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 100000)

In [None]:
df_train[df_train['usage_lev_3'] > 0].boxplot(column='usage_lev_3', by='mtr_code', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 10000)

In [None]:
df_train[df_train['usage_lev_2'] > 0].boxplot(column='usage_lev_2', by='mtr_code', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 50000)

In [None]:
df_train[df_train['usage_lev_1'] > 0].boxplot(column='usage_lev_1', by='mtr_code', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 50000)

In [None]:
#plt.pyplot.boxplot(df_train['mtr_code'], df_train['usage_lev_4'])
df_train[df_train['usage_lev_4'] > 0].boxplot(column='usage_lev_4', by='tariff_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 10000)

In [None]:
#plt.pyplot.boxplot(df_train['mtr_code'], df_train['usage_lev_4'])
df_train[df_train['usage_lev_3'] > 0].boxplot(column='usage_lev_3', by='tariff_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 10000)

In [None]:
#plt.pyplot.boxplot(df_train['mtr_code'], df_train['usage_lev_4'])
df_train[df_train['usage_lev_2'] > 0].boxplot(column='usage_lev_2', by='tariff_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 10000)

In [None]:
#plt.pyplot.boxplot(df_train['mtr_code'], df_train['usage_lev_4'])
df_train.boxplot(column='usage_lev_1', by='tariff_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 10000)

In [None]:
df_train.boxplot(column='mtr_code', by='tariff_type', grid=False, showmeans=True, vert = False)

In [None]:
df_train[df_train['usage_lev_4'] > 0].boxplot(column='usage_lev_4', by='client_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 100000)

In [None]:
df_train[df_train['usage_lev_3'] > 0].boxplot(column='usage_lev_3', by='client_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 100000)

In [None]:
df_train[df_train['usage_lev_2'] > 0].boxplot(column='usage_lev_2', by='client_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 100000)

In [None]:
df_train[['tariff_type', 'client_type']].value_counts()

In [None]:
df_train[df_train['usage_lev_1'] > 0].boxplot(column='usage_lev_1', by='client_type', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 100000)

In [None]:
df_train[df_train['tariff_type'] == 10][['tariff_type', 'mtr_code']].value_counts()

In [None]:
#df_train.boxplot(column='invoice_date', by='target', grid=False, showmeans=True, vert = False)
#pd.to_timedelta(df_train['invoice_date']).dt.total_seconds()
df_temp = df_train[['invoice_date', 'target']]
df_temp['invoice_date'] = (df_temp['invoice_date'] - pd.to_datetime('1980', format='%Y')).dt.total_seconds()
df_temp.boxplot(column='invoice_date', by='target', grid=False, showmeans=True, vert = False)

In [None]:
df_train.groupby('target').agg({'invoice_date': ['min', 'max']})

In [None]:
#df_train.groupby('target').agg({'client_id': ['nunique', 'count']})




pd.merge(df_train['client_id'][df_train['target'] == 1], df_train['client_id'][df_train['target'] == 0], how = 'inner')

In [None]:
(df_train.groupby('client_id').agg({'target': ['nunique']}) > 1).sum()

In [None]:
(df_train.groupby('client_id').agg({'creation_date': ['nunique']}) > 1).sum()

In [None]:
(df_train.groupby('mtr_num').agg({'client_id': ['nunique']}) > 1).sum()

In [None]:
(df_train.groupby('client_id').agg({'target': ['nunique']}) > 1).sum()

In [None]:
df_train[['invoice_date', 'creation_date', 'mtr_num','usage_lev_1', 'usage_lev_2', 'usage_lev_3', 'usage_lev_4']][(df_train['client_id'] == 'train_Client_76644') & (df_train['counter_type'] == 'ELEC')].sort_values('invoice_date')

In [None]:
df_train[['invoice_date', 'creation_date', 'mtr_num','usage_lev_1', 'usage_lev_2', 'usage_lev_3', 'usage_lev_4', 'tariff_type', 'mtr_coef', 'mtr_code', 'reading_remarks']][(df_train['client_id'] == 'train_Client_76644') & (df_train['counter_type'] == 'ELEC') & (df_train['mtr_num'] == 68778)].sort_values('invoice_date').tail(60)

In [None]:
df_train[df_train['usage_lev_4'] > 0].groupby(['target']).agg({'usage_lev_4': ['count']})

In [None]:
df_train[df_train['usage_lev_4'] == 0].groupby(['target']).agg({'usage_lev_4': ['count']})

In [None]:
print(78519/4045118)
print(14439/338673)

In [None]:
df_train[df_train['usage_lev_1'] > 0].groupby(['target']).agg({'usage_lev_1': ['count']})

In [None]:
df_train[df_train['usage_lev_1'] == 0].groupby(['target']).agg({'usage_lev_1': ['count']})

In [None]:
print(3693828/429809)
print(315368/37744)

In [None]:
df_train[df_train['usage_lev_2'] > 0].groupby(['target']).agg({'usage_lev_2': ['count']})

In [None]:
df_train[df_train['usage_lev_2'] == 0].groupby(['target']).agg({'usage_lev_2': ['count']})

In [None]:
print(593658/3529979)
print(66912/286200)

In [None]:
df_train[df_train['usage_lev_3'] > 0].groupby(['target']).agg({'usage_lev_3': ['count']})

In [None]:
df_train[df_train['usage_lev_3'] == 0].groupby(['target']).agg({'usage_lev_3': ['count']})

In [None]:
print(159820/3963817)
print(23538/329574)

In [None]:
df_train[df_train['mtr_coef'] != 1].groupby(['target']).agg({'mtr_coef': lambda x: x.astype(int).median()})


In [None]:
df_train.boxplot(column='mtr_coef', by='target', grid=False, showmeans=True, vert = False)
plt.pyplot.xlim(0, 20)

In [None]:
df_train.groupby('target').agg({'mtr_coef' : 'value_counts'})

In [None]:
# Group by client_id and aggregate unique mtr_coef values
unique_mtr_by_client = df_train.groupby('client_id')['mtr_code'].unique()

# Convert to a flat structure with client_id and unique mtr_coef values
client_mtr_flat = unique_mtr_by_client.explode().reset_index()

# Add target information back to the flattened data
client_mtr_flat = client_mtr_flat.merge(df_train[['client_id', 'target']].drop_duplicates(), on='client_id')

# Group by target and count unique mtr_coef occurrences
result = client_mtr_flat.groupby('target')['mtr_code'].value_counts().reset_index(name='count')

In [None]:
print(result.sort_values(['target', 'mtr_code']))

In [None]:
temp = result.sort_values(['target', 'mtr_code'])
temp = temp.pivot(index='mtr_code', columns='target', values='count').reset_index()
# Rename columns for clarity
temp.columns = ['mtr_code', 'target_0', 'target_1']

# Fill missing values with 0 (in case some mtr_codes do not have counts for one of the targets)
temp = temp.fillna(0)
temp

In [None]:
df_train['region'].unique()

In [None]:
df_train.boxplot(column='target', by='region', grid=False, showmeans=True, vert = False)
#plt.pyplot.xlim(0, 20)

In [None]:
df_train[['district', 'region']].value_counts().sort_index()

In [None]:
(df_train.groupby('client_id').agg({'                                                                                                                                                                                                                                                                                                                                      ': 'nunique'}) >1).sum()

In [None]:
# Example data
temp = {
    'mtr_code':       temp['mtr_code'],
    'target_0_count': temp['target_0'],
    'target_1_count': temp['target_1']
}

df_temp = pd.DataFrame(temp)
df_temp['target_0_count'] = df_temp['target_0_count'].astype('int')
df_temp['target_1_count'] = df_temp['target_1_count'].astype('int')

# Calculate total population
df_temp['obs_y'] = df_temp['target_0_count'] + df_temp['target_1_count']

# Total number of fraud cases
temp = df_temp['target_1_count'].sum()

# Observed fraud distribution
df_temp['obs_y_p'] = df_temp['target_1_count'] / temp
df_temp['obs_y_n'] = df_temp['target_1_count']

# Simulate random subsets from total population
n_bootstraps = 1000
random_proportions = []
random_counts = []
for _ in range(n_bootstraps):
    # Randomly sample indices proportional to the population
    sampled = np.random.choice(df_temp['mtr_code'], size=temp, p=df_temp['obs_y'] / df_temp['obs_y'].sum())
    # Calculate proportions for each mtr_coef in the sample
    proportions = [np.sum(sampled == coef) / temp for coef in df_temp['mtr_code']]
    random_proportions.append(proportions)
    counts = [np.sum(sampled == coef) for coef in df_temp['mtr_code']]
    random_counts.append(counts)

# Convert to DataFrame for analysis
random_proportions_df = pd.DataFrame(random_proportions, columns=df_temp['mtr_code'])
random_counts_df = pd.DataFrame(random_counts, columns=df_temp['mtr_code'])

# Calculate confidence intervals for random proportions
ci_lower_proportions = random_proportions_df.quantile(0.025)
ci_upper_proportions = random_proportions_df.quantile(0.975)
ci_lower_counts = random_counts_df.quantile(0.025)
ci_upper_counts = random_counts_df.quantile(0.975)

# Add results to the original DataFrame
df_temp['ci_lo_p'] = ci_lower_proportions.values
df_temp['ci_up_p'] = ci_upper_proportions.values
df_temp['ci_lo_n'] = ci_lower_counts.values
df_temp['ci_up_n'] = ci_upper_counts.values

# Compare observed proportions to confidence intervals
df_temp['sig_p'] = (df_temp['obs_y_p'] < df_temp['ci_lo_p']) | (df_temp['obs_y_p'] > df_temp['ci_up_p'])
df_temp['sig_n'] = (df_temp['target_1_count'] < df_temp['ci_lo_n']) | (df_temp['target_1_count'] > df_temp['ci_up_n'])

# Display results
print(df_temp[['mtr_code', 'obs_y_n', 'obs_y_p', 'ci_lo_p', 'ci_up_p', 'sig_p', 'ci_lo_n', 'ci_up_n', 'sig_n']])


In [None]:
print((df_train['mtr_num'] < 20000000000000).sum())
print((df_train['mtr_num'] >  5000000000000).sum())
print((df_train['mtr_num'] > 10000000000000).sum())
print((df_train['mtr_num'] > 20000000000000).sum())
print((df_train['mtr_num'] > 25000000000000).sum())

In [None]:
df_train['mtr_num'][df_train['target'] == 1].describe()

In [None]:
df_train['mtr_num'][df_train['target'] == 0].describe()

In [None]:
print(df_temp[['mtr_code', 'obs_y_n', 'ci_lo_n', 'ci_up_n', 'sig_n']])

In [None]:
df_train['mtr_num'].nunique()

In [None]:
# Group by client_id and aggregate unique mtr_coef values
unique_mtr_by_client = df_train.groupby('client_id')['mtr_coef'].unique()

# Convert to a flat structure with client_id and unique mtr_coef values
client_mtr_flat = unique_mtr_by_client.explode().reset_index()

# Add target information back to the flattened data
client_mtr_flat = client_mtr_flat.merge(df_train[['client_id', 'target']].drop_duplicates(), on='client_id')

# Group by target and count unique mtr_coef occurrences
result = client_mtr_flat.groupby('target')['mtr_coef'].value_counts().reset_index(name='count')

In [None]:
print(result.sort_values(['target', 'mtr_coef']))

In [None]:
result.groupby('target').agg({'count' : 'sum'})

In [None]:
# Example data
temp = {
    'mtr_coef':       [  0,      1,  2,  3, 4, 5, 6, 8, 9, 10, 11, 20, 30, 33, 40, 50],
    'target_0_count': [ 24, 127893, 21, 10, 3, 0, 5, 1, 2,  4,  1,  1,  1,  1,  2,  1],
    'target_1_count': [  1,   7566,  0,  0, 0, 1, 0, 0, 0,  1,  0,  0,  0,  0,  0,  0]
}

df_temp = pd.DataFrame(temp)

# Calculate total population
df_temp['obs_y'] = df_temp['target_0_count'] + df_temp['target_1_count']

# Total number of fraud cases
temp = df_temp['target_1_count'].sum()

# Observed fraud distribution
df_temp['obs_y_p'] = df_temp['target_1_count'] / temp
df_temp['obs_y_n'] = df_temp['target_1_count']

# Simulate random subsets from total population
n_bootstraps = 10000
random_proportions = []
random_counts = []
for _ in range(n_bootstraps):
    # Randomly sample indices proportional to the population
    sampled = np.random.choice(df_temp['mtr_coef'], size=temp, p=df_temp['obs_y'] / df_temp['obs_y'].sum())
    # Calculate proportions for each mtr_coef in the sample
    proportions = [np.sum(sampled == coef) / temp for coef in df_temp['mtr_coef']]
    random_proportions.append(proportions)
    counts = [np.sum(sampled == coef) for coef in df_temp['mtr_coef']]
    random_counts.append(counts)

# Convert to DataFrame for analysis
random_proportions_df = pd.DataFrame(random_proportions, columns=df_temp['mtr_coef'])
random_counts_df = pd.DataFrame(random_counts, columns=df_temp['mtr_coef'])

# Calculate confidence intervals for random proportions
ci_lower_proportions = random_proportions_df.quantile(0.025)
ci_upper_proportions = random_proportions_df.quantile(0.975)
ci_lower_counts = random_counts_df.quantile(0.025)
ci_upper_counts = random_counts_df.quantile(0.975)

# Add results to the original DataFrame
df_temp['ci_lo_p'] = ci_lower_proportions.values
df_temp['ci_up_p'] = ci_upper_proportions.values
df_temp['ci_lo_n'] = ci_lower_counts.values
df_temp['ci_up_n'] = ci_upper_counts.values

# Compare observed proportions to confidence intervals
df_temp['sig_p'] = (df_temp['obs_y_p'] < df_temp['ci_lo_p']) | (df_temp['obs_y_p'] > df_temp['ci_up_p'])
df_temp['sig_n'] = (df_temp['target_1_count'] < df_temp['ci_lo_n']) | (df_temp['target_1_count'] > df_temp['ci_up_n'])

# Display results
print(df_temp[['mtr_coef', 'obs_y_n', 'obs_y_p', 'ci_lo_p', 'ci_up_p', 'sig_p', 'ci_lo_n', 'ci_up_n', 'sig_n']])


In [None]:
pd.set_option('display.float_format', '{:.6f}'.format)
#pd.reset_option('display.float_format')

In [None]:
(df_train[df_train['target'] == 0]['new_idx'] - df_train[df_train['target'] == 0]['old_idx']).describe()

In [None]:
(df_train[df_train['target'] == 1]['new_idx'] - df_train[df_train['target'] == 1]['old_idx']).describe()

In [None]:
print(df_temp[['mtr_coef', 'obs_y_n', 'ci_lo_n', 'ci_up_n', 'sig_n']])

In [None]:
(df_train['target'] == 1).sum() / len(df_train)

In [None]:
(df_train['mtr_coef'] == 10).sum()

In [None]:
df_train['mtr_num'].nunique()

In [None]:
df_train['client_id'].nunique()

There is no clear link between meter coefficient and usage level.

In [None]:
df_temp[(df_temp['usage_lev_3'] > 0) | (df_temp['usage_lev_4'] > 0)].head(60)

When L3 and L4 present, L2 seems to be (L1 / 2).

In [None]:
mask = (df_temp['usage_lev_3'] > 0)
df_temp['usage_lev_1'][mask][(df_temp['usage_lev_1'][mask] - (df_temp['usage_lev_2'][mask] * 2)) != 0].unique()

Looking at L3, that rule is only broken when L1 is 5 or 1. 

In [None]:
mask = (df_temp['usage_lev_4'] > 0)
df_temp[mask].head(60)

seems incremental. I was thinking it could have been paired (for gas an electricity etc). Seems like L3 is twice L2 and same as L1.

Looks like the usage_levels spill over. L2 being half L1. L3 being equal to L1. L4 then being uncapped. Although this is broadly true, There are some unexpected L1 values where this is broken (L1 == 5 ! L1 == 1)

In [None]:
# Ignore temporarily usage_lev_1 == 1 or 5 since they seem weird
mask = (df_temp['usage_lev_1'] != 1) & (df_temp['usage_lev_1'] != 5)
df_temp[mask][(df_temp['usage_lev_1'][mask] < df_temp['usage_lev_3'][mask])]

In [None]:
# Ignore temporarily usage_lev_1 == 1 or 5 since they seem weird
mask = (df_temp['usage_lev_1'] != 1) & (df_temp['usage_lev_1'] != 5)
df_temp[mask][(df_temp['usage_lev_1'][mask] < (df_temp['usage_lev_2'][mask] * 2))]

In [None]:
print(df_temp['usage_lev_3'][mask][(df_temp['usage_lev_1'][mask] < (df_temp['usage_lev_2'][mask] * 2))].unique())
print(df_temp['usage_lev_4'][mask][(df_temp['usage_lev_1'][mask] < (df_temp['usage_lev_2'][mask] * 2))].unique())

L1 and L3 capping at same value seems very rarely broken, and does not seem associated with being fraud.

L2 cap exceeding 2 * L1 seems to happen too frequently to consider ignoring. But it only happens when not "spilling" into L3 and L4.

In [None]:
test = df_train['tarif_type'].astype("category")
#test.cat.categories
#pd.Categorical.from_codes(splitter, categories=tariff_codes)

Potentially, there is a meaninful scale here between 0 - 5, but it would be hard to know. Going to treat as categorical.

In [None]:
print(df_train['counter_number'].unique())

In [None]:
# client_id: object -> factor
# counter_statue: why is that not an int64?
# counter_type: why is that not an int64?
# target: object -> factor


In [None]:
df_train['counter_statue'].unique()

In [None]:
df_train['counter_type'].unique()

In [None]:
(df_train.groupby(['client_id', 'mtr_code']).agg({'reading_remarks' : 'nunique'}) > 1).sum()

## Importing Dataset (Optional)

This is only for downloading the data from kaggle. You can obtain it however you wish, just have it as .csv files with their original names, all placed in the root folder of the repository. For the method below, you need an API key saved (in Windows: C:\Users\<Windows-username>\.kaggle\kaggle.json): please refer to [relevant Kaggle link](https://www.kaggle.com/docs/api#interacting-with-datasets) for more info.

In [None]:
import kaggle, zipfile
!kaggle datasets download mrmorj/fraud-detection-in-electricity-and-gas-consumption

In [None]:
# Unzip the data files (~480MB total).
f_path = './fraud-detection-in-electricity-and-gas-consumption.zip'

with zipfile.ZipFile(f_path, 'r') as zip_file:
    zip_file.extractall('./')