# Fraud Detection in Tunisian Energy and Gas Consumption

### Tetyana, Christian and Jakob

![alt text](images/under_construction.jpg)


## Case Description

<img src="images/steg.jpg" width="300"/>

- Tunisian Company of Electricity and Gas (STEG) delivers electricity and gas
- Suffered 65 million USD losses due to manipulated meters
- Goal: detect fraudulent and non-fraudulent clients to avoid financial and reputation damage

## Data

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

# Loading the data
df_client = pd.read_csv("data/client_train.csv")
df_invoice = pd.read_csv("data/invoice_train.csv")

# Merging
df = df_invoice.merge(df_client, on="client_id")

display(df)

Unnamed: 0,client_id,invoice_date,tarif_type,counter_number,counter_statue,counter_code,reading_remarque,counter_coefficient,consommation_level_1,consommation_level_2,...,consommation_level_4,old_index,new_index,months_number,counter_type,disrict,client_catg,region,creation_date,target
0,train_Client_0,2014-03-24,11,1335667,0,203,8,1,82,0,...,0,14302,14384,4,ELEC,60,11,101,31/12/1994,0.0
1,train_Client_0,2013-03-29,11,1335667,0,203,6,1,1200,184,...,0,12294,13678,4,ELEC,60,11,101,31/12/1994,0.0
2,train_Client_0,2015-03-23,11,1335667,0,203,8,1,123,0,...,0,14624,14747,4,ELEC,60,11,101,31/12/1994,0.0
3,train_Client_0,2015-07-13,11,1335667,0,207,8,1,102,0,...,0,14747,14849,4,ELEC,60,11,101,31/12/1994,0.0
4,train_Client_0,2016-11-17,11,1335667,0,207,9,1,572,0,...,0,15066,15638,12,ELEC,60,11,101,31/12/1994,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4476744,train_Client_99998,2005-08-19,10,1253571,0,202,9,1,400,135,...,0,3197,3732,8,ELEC,60,11,101,22/12/1993,0.0
4476745,train_Client_99998,2005-12-19,10,1253571,0,202,6,1,200,6,...,0,3732,3938,4,ELEC,60,11,101,22/12/1993,0.0
4476746,train_Client_99999,1996-09-25,11,560948,0,203,6,1,259,0,...,0,13884,14143,4,ELEC,60,11,101,18/02/1986,0.0
4476747,train_Client_99999,1996-05-28,11,560948,0,203,6,1,603,0,...,0,13281,13884,4,ELEC,60,11,101,18/02/1986,0.0


## Data Challenges

1. Poor data documentation
2. Very large dataset
3. Highly imbalanced
4. Columns with mixed data-types and out-of-range values
5. No fraud detection before 2005


## Tested Classifiers
1. KNN: long prediction time.
2. Decision Trees: fast and very good performance with good data-cleaning.
3. Random Forests: good performance, computational time okay.
4. Extra Trees: okay performance, fast
5. K-Means: fast, poor classification performance
6. LightGBM: fast, poor classification performance.
7. XGboost: fast, poor classification performance.
8. Logistic Regression: fast, poor classification performance.
9. AdaBoost: slow, poor performance.
10. SVM: very slow.
11. Naive Bayes Classifier: fast but poor performance.

## Fraud Detector Performance

![alt text](images/confusion_matrix_knn.png)

## Recommendations for Stakeholder

- Provide better data documentation
- Educate employees in data acquisition
- Check meter number for manipulations