# Fraud Detection in Electricity and Gas Consumption Challenge

This is a simple starter notebook based on the tutorial prepared by Joy Wawira. Check out the article [here](https://zindi.africa/learn/fraud-detection-in-electricity-and-gas-consumption-challenge-tutorial) for a more detailed description of the steps taken.

This notebook covers:
- Downloading the data straight from Zindi and onto colab
- Loading the data and carrying out simple EDA to understand the data and prepare for modelling 
- Preprocessing the data and feature engineering 
- Creating a simple LGBM model and predicting on the test set
- Prepare submission file and save as csv
- Some tips on how to improve model performance and your score

**Columns explanation**


About
The data provided by STEG is composed of two files. The first one is comprised of client data and the second one contains billing history from 2005 to 2019.

There are 2 .zip files for download, train.zip, and test.zip and a SampleSubmission.csv. In each .zip file you will find a client and invoice file.

Variable definitions

Client:

* Client_id: Unique id for client
* District: District where the client is
* Client_catg: Category client belongs to
* Region: Area where the client is
* Creation_date: Date client joined
* Target: fraud:1 , not fraud: 0

Invoice data

* Client_id: Unique id for the client
* Invoice_date: Date of the invoice
* Tarif_type: Type of tax
* Counter_number:
* Counter_statue: takes up to 5 values such as working fine, not working, on hold statue, ect
* Counter_code:
* Reading_remarque: notes that the STEG agent takes during his visit to the client (e.g: If the counter shows something wrong, the agent gives a bad score)
* Counter_coefficient: An additional coefficient to be added when standard consumption is exceeded
* Consommation_level_1: Consumption_level_1
* Consommation_level_2: Consumption_level_2
* Consommation_level_3: Consumption_level_3
* Consommation_level_4: Consumption_level_4
* Old_index: Old index
* New_index: New index
* Months_number: Month number
* Counter_type: Type of counter

In [None]:
# Import libraries

In [10]:
import warnings

warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os

from imblearn.over_sampling import SMOTE

# Preprocessing
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, OneHotEncoder, OrdinalEncoder
from src.preprocessing.agg_invoice import agg_invoice_num_mode_no_monthly_weighting
from src.preprocessing.cleaning import convert_dtype_cat_date



from sklearn.svm import SVC
from sklearn.metrics import fbeta_score, make_scorer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, roc_auc_score, roc_curve, confusion_matrix, classification_report
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import warnings
from sklearn import preprocessing
from sklearn.preprocessing import OrdinalEncoder

from matplotlib.ticker import PercentFormatter
plt.rcParams.update({ "figure.figsize" : (8, 5),"axes.facecolor" : "white", "axes.edgecolor":  "black"})
plt.rcParams["figure.facecolor"]= "w"
pd.plotting.register_matplotlib_converters()
pd.set_option('display.float_format', lambda x: '%.3f' % x)


# Define a plotting style to be used for all plots in this notebook
plt.style.use('tableau-colorblind10')

#from sqalchemy import create_engine
#from dotenv import load_dotenv

ImportError: cannot import name 'convert_dtype_cat_date' from 'src.preprocessing.cleaning' (/Users/neuefische/repos/ml_project_mariusbosch/gas-fraud-team-crime/src/preprocessing/cleaning.py)

In [2]:
df_client_train = pd.read_csv('data/train/client_train.csv')
df_invoice_train = pd.read_csv('data/train/invoice_train.csv')

In [5]:
display(df_invoice_train.info(),df_invoice_train.describe(),df_invoice_train.shape,df_invoice_train.head(15))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4476749 entries, 0 to 4476748
Data columns (total 16 columns):
 #   Column                Dtype 
---  ------                ----- 
 0   client_id             object
 1   invoice_date          object
 2   tarif_type            int64 
 3   counter_number        int64 
 4   counter_statue        object
 5   counter_code          int64 
 6   reading_remarque      int64 
 7   counter_coefficient   int64 
 8   consommation_level_1  int64 
 9   consommation_level_2  int64 
 10  consommation_level_3  int64 
 11  consommation_level_4  int64 
 12  old_index             int64 
 13  new_index             int64 
 14  months_number         int64 
 15  counter_type          object
dtypes: int64(12), object(4)
memory usage: 546.5+ MB


None

Unnamed: 0,tarif_type,counter_number,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number
count,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0,4476749.0
mean,20.128,123058699065.183,172.488,7.322,1.003,410.979,109.322,20.306,52.926,17766.998,18349.705,44.831
std,13.473,1657267274261.927,133.887,1.572,0.308,757.308,1220.123,157.424,875.472,40366.926,40953.206,3128.335
min,8.0,0.0,0.0,5.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,11.0,121108.0,5.0,6.0,1.0,79.0,0.0,0.0,0.0,1791.0,2056.0,4.0
50%,11.0,494561.0,203.0,8.0,1.0,274.0,0.0,0.0,0.0,7690.0,8192.0,4.0
75%,40.0,1115161.0,207.0,9.0,1.0,600.0,0.0,0.0,0.0,21660.0,22343.0,4.0
max,45.0,27981145458733.0,600.0,413.0,50.0,999910.0,999073.0,64492.0,547946.0,2800280.0,2870972.0,636624.0


(4476749, 16)

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,consommation_level_3,consommation_level_4,old_index,new_index,months_number,counter_type
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,0,0,14302,14384,4,ELEC
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,0,0,12294,13678,4,ELEC
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,0,0,14624,14747,4,ELEC
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,0,0,14747,14849,4,ELEC
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,0,0,15066,15638,12,ELEC
5,train_Client_0,2017-07-17,11,1335667,0,207,9,1,314,0,0,0,15638,15952,8,ELEC
6,train_Client_0,2018-12-07,11,1335667,0,207,9,1,541,0,0,0,15952,16493,12,ELEC
7,train_Client_0,2019-03-19,11,1335667,0,207,9,1,585,0,0,0,16493,17078,8,ELEC
8,train_Client_0,2011-07-22,11,1335667,0,203,9,1,1200,186,0,0,7770,9156,4,ELEC
9,train_Client_0,2011-11-22,11,1335667,0,203,6,1,1082,0,0,0,9156,10238,4,ELEC


In [8]:
df_invoice_train['months_number'].unique()

array([     4,     12,      8, ..., 367445, 459733, 463554])

In [9]:
display(df_client_train.info(),df_client_train.describe(),df_client_train.shape,df_client_train.head(15))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 135493 entries, 0 to 135492
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   disrict        135493 non-null  int64  
 1   client_id      135493 non-null  object 
 2   client_catg    135493 non-null  int64  
 3   region         135493 non-null  int64  
 4   creation_date  135493 non-null  object 
 5   target         135493 non-null  float64
dtypes: float64(1), int64(3), object(2)
memory usage: 6.2+ MB


None

Unnamed: 0,disrict,client_catg,region,target
count,135493.0,135493.0,135493.0,135493.0
mean,63.511,11.513,206.16,0.056
std,3.354,4.424,104.207,0.23
min,60.0,11.0,101.0,0.0
25%,62.0,11.0,103.0,0.0
50%,62.0,11.0,107.0,0.0
75%,69.0,11.0,307.0,0.0
max,69.0,51.0,399.0,1.0


(135493, 6)

Unnamed: 0,disrict,client_id,client_catg,region,creation_date,target
0,60,train_Client_0,11,101,31/12/1994,0.0
1,69,train_Client_1,11,107,29/05/2002,0.0
2,62,train_Client_10,11,301,13/03/1986,0.0
3,69,train_Client_100,11,105,11/07/1996,0.0
4,62,train_Client_1000,11,303,14/10/2014,0.0
5,69,train_Client_10000,11,103,29/09/1993,0.0
6,62,train_Client_100000,11,309,07/06/2012,0.0
7,60,train_Client_100001,11,101,12/04/2006,0.0
8,62,train_Client_100002,11,301,31/12/2006,0.0
9,60,train_Client_100003,11,101,25/10/2011,0.0


In [None]:
# Feature engineering (dummy variables)

In [None]:
# Feature engineering (dummy variables)

df_invoice= agg_invoice_num_mode_no_monthly_weighting(df_invoice_train)

In [None]:
# merge dataframes

df_combined = pd.merge(df_client_train, df_invoice, on='client_id', how='inner')


In [None]:
# call cleaning function (outlier, date time convert)

df_combined = convert_dtype_cat_date(df_combined) 

In [None]:
# Define target(y) and features (X)

X = df_combined.drop(['target'], axis=1)
y = df_combined['target']

In [None]:
# train test split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, stratify=y, random_state=42)

In [None]:
# SMOTE



In [None]:
# Scaling