*****PAS 2024 Ressources P3 : leak detection (oil & gas)*****

In this file, we provide some libraries and functions that you will need for the project. However, don't limit yourself to what is given here; it's up to you to explore and find additionnal resources to successfully complete the project.

# **1. Libraries**

In [None]:
import pandas as pd  # importing and exporting data
import numpy as np  # numerical computing
import matplotlib.pyplot as plt  # Plot for data visualisation
import seaborn as sns   # statistical data visualisation
from sklearn.model_selection import train_test_split # data separation: train and test data
from sklearn.preprocessing import StandardScaler, LabelEncoder  # data normalisation
from imblearn.over_sampling import SMOTE # Oversampling
from sklearn.compose import make_column_selector # Separate numerical and categorical variables

# Classical machine learning models for classification
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
import xgboost as xgb # gradient boosting model
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix  # model evaluation

# Deep learning models for classification with tensorflow. You can also use Pytorch
from tensorflow.keras.models import Sequential # neural networl model
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization # layers neural network
from tensorflow.keras.callbacks import EarlyStopping # callback that monitors a specified metric during training and stops the training process when that metric stops improving.



# **2. Data import**

In [None]:
Path  = 'File path / file_name.csv'
data = pd.read_csv(path)
data

# **3. Data Exploration**

Data information: provide variables, number of non-null values and and data type

In [None]:
data.info()

Data size

In [None]:
data.shape

Create a pie chart showing the distribution of different data types, if necessary.

In [None]:
data.dtypes.value_counts().plot.pie()

Visualize missing values

In [None]:
sns.heatmap(data.isna(), cbar=False) # shows where missing values (NaNs) occur in the DataFrame

Analyze the distribution of values in the target

In [None]:
data["target"].value_counts(normalize=True)  # proportion of each unique values in the target column

Proportion of missing values (NaNs) in each variable

In [None]:
data["variable"].isna().sum()/data.shape[0]

If necessary and for visualization of each categorical, create a pie chart for each categorical (object) column in the DataFrame data, showing the distribution of unique values within that column.

In [None]:
for col in data.select_dtypes('object'):
plt.figure()
data[col].value_counts().plot.pie() # Each chart displays the proportions of different categories in a visually intuitive way.

**Relationship between target and variables**

 Create subset with leak and no leak

In [None]:
data = data.drop(['id'], axis=1) # you need to drop the id variable

leak_yes = data[data['target']==1]
leak_no = data[data['target']==0]
leak_yes.describe()
leak_no.describe()

Relationship between target and categorical variables



---



In [None]:
# Generates distribution plots for each float column in the DataFrame data, comparing the distributions of the two groups.
for col in data.select_dtypes('float'):
plt.figure()
sns.distplot(leak_yes[col], label='leak')
sns.distplot(leak_no[col], label='no leak')
plt.legend()

**Relationship between variables**

Correlations

In [None]:
# Correlation between varaible
sns.clustermap(data.corr())

In [None]:
# Visualization tool to explore relationships between multiple variables, allowing you to quickly identify patterns, correlations, and potential outliers.
sns.pairplot(data)

**We can also look at the relationship between each variable and the target**

If necessary, replace missing values by median. You can also use the mean or others statistics

In [None]:
median = data["variable with missing values"].median()
data["variable with missing values"].fillna(median, inplace = True)
data

If necessary, separate numerical and categorical variables

In [None]:
numerical_columns_selector = make_column_selector(dtype_exclude=object)
categorical_columns_selector = make_column_selector(dtype_include=object)

numerical_columns = numerical_columns_selector(data)
categorical_columns = categorical_columns_selector(data)
numerical_columns
categorical_columns

Data preprocessing

In [None]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")
numerical_preprocessor = StandardScaler()

preprocessor = ColumnTransformer([
    ('onehotencoder',categorical_preprocessor,categorical_columns),
    ('standardscaler',numerical_preprocessor,numerical_columns)])

data = preprocessor.fit_transform(data)
data

Data separation for train and validation

In [None]:
data_train, data_val, target_train, target_val = train_test_split(data, target, test_size =0.30, random_state = 42)
target_train.value_counts() # the size of training set

If necessary, oversampling the training set. Useful for unbalanced classes

In [None]:
# Add synthetic data to have balanced classes for training.

SMOTE = SMOTE()
data_train_smote, target_train_smote = SMOTE.fit_resample(data_train, target_train)
target_train_smote.value_counts()

**Training model**

In [None]:
model = model_used() # model_used can be KNN, SVM, Random Forest,........
model.fit(data_train, target_train)
mdodel.fit(data_train_smote, target_train_smote) # if oversampling

**Model predict**

In [None]:
target_pred  = model.predict(data_val)

**Evaluation**

In [None]:
from sklearn.metrics import classification_report # provides key metrics such as precision, recall, F1 score, and support for each class in your classification problem.
from sklearn import metrics
fpr, tpr, thresholds = metrics.roc_curve(target_val, target_pred)

print('AUC : ', metrics.auc(fpr, tpr))
print(classification_report(y_test, y_pred))
plt.plot(fpr, tpr, label ='model_name')
plt.xlabel('False positive rate')
plt.ylabel('True positive rate')
plt.legend()

Confusion matrix is also useful.

**We also recommend you to use the powerful library GridSearchCV in order to optimize the parameters of the used model.**