# Fraud Detection in Electricity and Gas Consumption Challenge

File name: FraudDetectionAI.ipynb

Author: kogni7

Date: October/November 2021

## Contents
* 1 Preparation
* 2 Data
* 3 Training
* 4 Prediction and Submission

This notebook uses only the data sets provided by ZINDI. These data sets contain information about clients. These are the only used features in this notebook. The task is to classify whether there is a fraud or not.

The file system for this project is:
* FraudDetectionAI (root)
    * FraudDetectionAI.ipynb (this notebook)
    * Data
        * client_test.csv
        * client_train.csv
        * invoice_test.csv
        * invoice_train.csv
        * SampleSubmission.csv
    * Submission
        * 1 - x: Submission directories named by the version number
            * submission.csv

This jupyter notebook runs in Google Colab without special configuration. GPU is disabled.

This notebook uses the julia version of XGBoost.

## 1 Preparation
Load data from Google Drive!

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Install julia kernel!

In [None]:
url = 'https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz'
!wget -O- $url | tar xz -C /usr/local --strip-components 1
!julia -e 'using Pkg; pkg"add IJulia; precompile;"'

--2021-11-09 16:21:15--  https://julialang-s3.julialang.org/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz
Resolving julialang-s3.julialang.org (julialang-s3.julialang.org)... 151.101.2.49, 151.101.66.49, 151.101.130.49, ...
Connecting to julialang-s3.julialang.org (julialang-s3.julialang.org)|151.101.2.49|:443... connected.
HTTP request sent, awaiting response... 302 gce internal redirect trigger
Location: https://storage.googleapis.com/julialang2/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz [following]
--2021-11-09 16:21:15--  https://storage.googleapis.com/julialang2/bin/linux/x64/1.6/julia-1.6.2-linux-x86_64.tar.gz
Resolving storage.googleapis.com (storage.googleapis.com)... 74.125.203.128, 74.125.204.128, 64.233.189.128, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|74.125.203.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 112946671 (108M) [application/x-tar]
Saving to: ‘STDOUT’


Reload tab for proper use of the julia kernel!

### Installation

In [1]:
import Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("StatsBase")
Pkg.add("MLLabelUtils")
Pkg.add("MLJ")
Pkg.add("XGBoost")
Pkg.add("MLJXGBoostInterface")
Pkg.add("Tables")
Pkg.add("FreqTables")

[32m[1m    Updating[22m[39m registry at `~/.julia/registries/General`
[32m[1m   Resolving[22m[39m package versions...
[32m[1m   Installed[22m[39m DataValueInterfaces ───────── v1.0.0
[32m[1m   Installed[22m[39m Crayons ───────────────────── v4.0.4
[32m[1m   Installed[22m[39m Formatting ────────────────── v0.4.2
[32m[1m   Installed[22m[39m Compat ────────────────────── v3.40.0
[32m[1m   Installed[22m[39m SortingAlgorithms ─────────── v1.0.1
[32m[1m   Installed[22m[39m OrderedCollections ────────── v1.4.1
[32m[1m   Installed[22m[39m DataFrames ────────────────── v1.2.2
[32m[1m   Installed[22m[39m IteratorInterfaceExtensions ─ v1.0.0
[32m[1m   Installed[22m[39m DataAPI ───────────────────── v1.9.0
[32m[1m   Installed[22m[39m InvertedIndices ───────────── v1.1.0
[32m[1m   Installed[22m[39m Tables ────────────────────── v1.6.0
[32m[1m   Installed[22m[39m DataStructures ────────────── v0.18.10
[32m[1m   Installed[22m[39m PooledArray

### Seed and Libraries

In [2]:
import Random
SEED = 42
Random.seed!(SEED)

using DataFrames, CSV
using MLLabelUtils
using Statistics
using StatsBase
using MLJ
using XGBoost
using Tables
using FreqTables

### Parameters

In [3]:
Version = "Version_3"
nfolds = 5
N = 100

100

## 2 Data

In [4]:
client_train_csv = DataFrame(CSV.File("/content/drive/MyDrive/FraudDetectionAI/Data/client_train.csv"))
invoice_train_csv = DataFrame(CSV.File("/content/drive/MyDrive/FraudDetectionAI/Data/invoice_train.csv"))

client_test_csv = DataFrame(CSV.File("/content/drive/MyDrive/FraudDetectionAI/Data/client_test.csv"))
invoice_test_csv = DataFrame(CSV.File("/content/drive/MyDrive/FraudDetectionAI/Data/invoice_test.csv"))

sample_submission_csv = DataFrame(CSV.File("/content/drive/MyDrive/FraudDetectionAI/Data/SampleSubmission.csv"))

# Remove unnecessary columns
client_train_csv = select!(client_train_csv, Not([:creation_date]))
invoice_train_csv = select!(invoice_train_csv, Not([:invoice_date]))

client_test_csv = select!(client_test_csv, Not([:creation_date]))
invoice_test_csv = select!(invoice_test_csv, Not([:invoice_date]))

# Label encode string columns
invoice_train_csv.counter_type = convertlabel(LabelEnc.Indices, invoice_train_csv.counter_type)
invoice_train_csv.counter_statue = convertlabel(LabelEnc.Indices, invoice_train_csv.counter_statue)

invoice_test_csv.counter_type = convertlabel(LabelEnc.Indices, invoice_test_csv.counter_type)
invoice_test_csv.counter_statue = convertlabel(LabelEnc.Indices, invoice_test_csv.counter_statue)

# Build the mean over the columns in invoice
invoice_train_csv = sort(combine(groupby(invoice_train_csv, :client_id),
                                         :tarif_type=>mean=>:tarif_type,
                                         :counter_number=>mean=>:counter_number,
                                         :counter_statue=>mean=>:counter_statue,
                                         :counter_code=>mean=>:counter_code,
                                         :reading_remarque=>mean=>:reading_remarque,
                                         :counter_coefficient=>mean=>:counter_coefficient,
                                         :consommation_level_1=>mean=>:consommation_level_1,
                                         :consommation_level_2=>mean=>:consommation_level_2,
                                         :consommation_level_3=>mean=>:consommation_level_3,
                                         :consommation_level_4=>mean=>:consommation_level_4,
                                         :old_index=>mean=>:old_index,
                                         :new_index=>mean=>:new_index,
                                         :months_number=>mean=>:months_number,
                                         :counter_type=>mean=>:counter_type,
                                         ), :client_id)

invoice_test_csv = sort(combine(groupby(invoice_test_csv, :client_id),
                                        :tarif_type=>mean=>:tarif_type,
                                        :counter_number=>mean=>:counter_number,
                                        :counter_statue=>mean=>:counter_statue,
                                        :counter_code=>mean=>:counter_code,
                                        :reading_remarque=>mean=>:reading_remarque,
                                        :counter_coefficient=>mean=>:counter_coefficient,
                                        :consommation_level_1=>mean=>:consommation_level_1,
                                        :consommation_level_2=>mean=>:consommation_level_2,
                                        :consommation_level_3=>mean=>:consommation_level_3,
                                        :consommation_level_4=>mean=>:consommation_level_4,
                                        :old_index=>mean=>:old_index,
                                        :new_index=>mean=>:new_index,
                                        :months_number=>mean=>:months_number,
                                        :counter_type=>mean=>:counter_type,
                                        ), :client_id)

# Merge the data frames
train_csv = sort(innerjoin(client_train_csv, invoice_train_csv, on = :client_id), :client_id)
test_csv = sort(innerjoin(client_test_csv, invoice_test_csv, on = :client_id), :client_id)

Unnamed: 0_level_0,disrict,client_id,client_catg,region,tarif_type,counter_number,counter_statue
Unnamed: 0_level_1,Int64,String31,Int64,Int64,Float64,Float64,Float64
1,62,test_Client_0,11,307,11.0,651208.0,1.0
2,69,test_Client_1,11,103,11.0,174760.0,1.04545
3,62,test_Client_10,11,310,23.5405,3.46809e6,1.0
4,60,test_Client_100,11,101,25.5,5.8665e5,1.0
5,62,test_Client_1000,11,301,20.8491,1.61411e6,1.03774
6,62,test_Client_10000,11,304,11.0,639605.0,1.0
7,62,test_Client_10001,11,303,25.3117,3.84798e6,1.0
8,69,test_Client_10002,11,104,29.039,3.70854e5,1.02597
9,62,test_Client_10003,11,309,11.0,454471.0,1.10526
10,69,test_Client_10004,11,107,21.7581,1.61557e6,1.01613


In [5]:
# Remove client_id
train_csv = select!(train_csv, Not([:client_id]))
test_csv = select!(test_csv, Not([:client_id]))

describe(train_csv)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,DataType
1,disrict,63.5112,60.0,62.0,69.0,0,Int64
2,client_catg,11.5125,11.0,11.0,51.0,0,Int64
3,region,206.16,101.0,107.0,399.0,0,Int64
4,target,0.0558405,0.0,0.0,1.0,0,Float64
5,tarif_type,17.1651,9.77778,11.0,45.0,0,Float64
6,counter_number,690920000000.0,0.0,669184.0,27391100000000.0,0,Float64
7,counter_statue,1.04486,1.0,1.0,10.0,0,Float64
8,counter_code,199.547,0.0,203.0,600.0,0,Float64
9,reading_remarque,7.3988,6.0,7.28571,413.0,0,Float64
10,counter_coefficient,1.00169,0.888889,1.0,50.0,0,Float64


In [6]:
describe(test_csv)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,DataType
1,disrict,63.5106,60.0,62.0,69.0,0,Int64
2,client_catg,11.5072,11.0,11.0,51.0,0,Int64
3,region,206.018,101.0,107.0,399.0,0,Int64
4,tarif_type,17.171,9.86207,11.0,45.0,0,Float64
5,counter_number,713907000000.0,0.0,664794.0,27391100000000.0,0,Float64
6,counter_statue,1.04316,1.0,1.0,5.0,0,Float64
7,counter_code,199.964,5.0,203.0,600.0,0,Float64
8,reading_remarque,7.39764,6.0,7.29167,9.0,0,Float64
9,counter_coefficient,1.00051,0.929825,1.0,10.0,0,Float64
10,consommation_level_1,434.251,0.0,358.474,57651.0,0,Float64


In [7]:
sort(freqtable(train_csv.target))

2-element Named Vector{Int64}
Dim1  │ 
──────┼───────
1.0   │   7566
0.0   │ 127927

In [8]:
# Make Data!
labels_xg = train_csv.target
features_xg = select(train_csv, Not([:target]))
labels_xg = coerce(labels_xg, Multiclass)
features_xg = Matrix(features_xg)
features_xg = MLJ.table(features_xg)

features_test_xg = test_csv
features_test_xg = Matrix(features_test_xg)
features_test_xg = MLJ.table(features_test_xg)

Tables.MatrixTable{Matrix{Float64}} with 58069 rows, 17 columns, and schema:
 :x1   Float64
 :x2   Float64
 :x3   Float64
 :x4   Float64
 :x5   Float64
 :x6   Float64
 :x7   Float64
 :x8   Float64
 :x9   Float64
 :x10  Float64
 :x11  Float64
 :x12  Float64
 :x13  Float64
 :x14  Float64
 :x15  Float64
 :x16  Float64
 :x17  Float64

## 3 Training

In [9]:
xg = @load XGBoostClassifier
xg = xg()
xg.seed = SEED
xg.num_round = 200

# eta = learning rate in [0, 1]
r1 = range(xg, :(eta), lower=0.01, upper=0.99)

# gamma in [0, ]
r2 = range(xg, :(gamma), lower=0.01, upper=2.5)

# max depth in [1, ]
r3 = range(xg, :(max_depth), lower=1, upper=10)

# min child weight in [0, ]
r4 = range(xg, :(min_child_weight), lower=0.01, upper=0.5)

# subsample in [0, 1]
r5 = range(xg, :(subsample), lower=0.01, upper=0.99)

xg_tune = TunedModel(model=xg, tuning=RandomSearch(rng=SEED), range=[r1, r2, r3, r4, r5], resampling=StratifiedCV(nfolds=nfolds, rng=SEED), measure=auc, n=N);
xg_mach = machine(xg_tune, features_xg, labels_xg)

MLJ.fit!(xg_mach)
report(xg_mach)

┌ Info: For silent loading, specify `verbosity=0`. 
└ @ Main /root/.julia/packages/MLJModels/4sRmw/src/loading.jl:168


import MLJXGBoostInterface ✔


┌ Info: Training Machine{ProbabilisticTunedModel{RandomSearch,…},…}.
└ @ MLJBase /root/.julia/packages/MLJBase/u6vLz/src/machines.jl:403
┌ Info: Attempting to evaluate 100 models.
└ @ MLJTuning /root/.julia/packages/MLJTuning/bjRHJ/src/tuned_models.jl:680


(best_model = XGBoostClassifier,
 best_history_entry = (model = XGBoostClassifier,
                       measure = [AreaUnderCurve()],
                       measurement = Float32[0.84079],
                       per_fold = Vector{Float32}[[0.83984184, 0.8288561, 0.8446317, 0.84244806, 0.84817255]],),
 history = NamedTuple{(:model, :measure, :measurement, :per_fold), Tuple{MLJXGBoostInterface.XGBoostClassifier, Vector{AreaUnderCurve}, Vector{Float32}, Vector{Vector{Float32}}}}[(model = XGBoostClassifier, measure = [AreaUnderCurve()], measurement = [0.74722403], per_fold = [[0.7459393, 0.74953866, 0.7458044, 0.7425663, 0.7522715]]), (model = XGBoostClassifier, measure = [AreaUnderCurve()], measurement = [0.79236704], per_fold = [[0.7897958, 0.7840758, 0.7953931, 0.79796326, 0.79460734]]), (model = XGBoostClassifier, measure = [AreaUnderCurve()], measurement = [0.83155394], per_fold = [[0.8279012, 0.82278943, 0.8366911, 0.83279186, 0.8375961]]), (model = XGBoostClassifier, measure = [Ar

## 4 Prediction and Submission

In [10]:
prediction = MLJ.predict(xg_mach, features_test_xg)
prediction = broadcast(pdf, prediction, 1.0)

sample_submission_csv.target = prediction
sample_submission_csv

Unnamed: 0_level_0,client_id,target
Unnamed: 0_level_1,String31,Float32
1,test_Client_0,0.0304951
2,test_Client_1,0.313587
3,test_Client_10,0.0520297
4,test_Client_100,0.0198115
5,test_Client_1000,0.0405611
6,test_Client_10000,0.0668477
7,test_Client_10001,0.00568761
8,test_Client_10002,0.218913
9,test_Client_10003,0.00527934
10,test_Client_10004,0.0552941


In [11]:
mkdir(string("/content/drive/MyDrive/FraudDetectionAI/Submission/", Version))

CSV.write(string("/content/drive/MyDrive/FraudDetectionAI/Submission/", Version, "/submission.csv"), sample_submission_csv)

"/content/drive/MyDrive/FraudDetectionAI/Submission/Version_3/submission.csv"