# <h1 style="font-family: Trebuchet MS; padding: 5px; font-size: 48px; color: #0066b2; text-align: center; line-height: 1;"><b>Fraud Detection<span style="color: #000000"> Data modeling </span></b></h1>


<a id="0"></a>

----

  
## Table of Contents
1. [Context](#context)
2. [Import Necessary Libraries](#import-libraries)
3. [Import Data](#import-data)
4. [Data preprcessing](#data-exploration)
   

---


<a id="context"></a>


# **Context**


This notebook focuses on building models to predict and detect fraud.

 <p><a href="#0" style="background-color: #e7e7e7;
  color: #008CBA;
  border: none;
  padding: 5px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 16px;
  margin: 4px 2px;
  cursor: pointer;
  border-radius: 100%;">Back</a></p>
  
---
  
## **Content**
``Unnamed: 0`` - Index or identifier for the data entries.

``trans_date_trans_time`` - Date and time of the transaction.

``cc_num`` - Credit card number used for the transaction.

``merchant`` - Merchant or vendor involved in the transaction.

``category`` - Category or type of transaction.

``amt`` - Transaction amount.

``first`` - First name of the credit card holder.

``last`` - Last name of the credit card holder.

``gender`` - Gender of the credit card holder.

``street`` - Street address of the credit card holder.

``city`` - City of the credit card holder.

``state`` - State of the credit card holder.

``zip`` - ZIP code of the credit card holder.

``lat`` - Latitude coordinate associated with the transaction.

``long`` - Longitude coordinate associated with the transaction.

``city_pop`` - Population of the city where the transaction occurred.

``job`` - Occupation or job of the credit card holder.

``dob`` - Date of birth of the credit card holder.

``trans_num`` -  Transaction number or identifier.

``unix_time`` - Transaction time in UNIX timestamp format.

``merch_lat`` - Latitude coordinate of the merchant's location.

``merch_long`` - Longitude coordinate of the merchant's location.

``is_fraud`` - Indicator for whether the transaction is fraudulent (binary: 1 for fraud, 0 for non-fraud).

<a id="import-libraries"></a>

# **Import Necessary Libraries**

 <p><a href="#0" style="background-color: #e7e7e7;
  color: #008CBA;
  border: none;
  padding: 5px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 16px;
  margin: 4px 2px;
  cursor: pointer;
  border-radius: 100%;">Back</a></p>

In [165]:
import sys
import pandas as pd
import mlflow
import datetime
import xgboost
from xgboost import XGBClassifier
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_recall_fscore_support as score
import imblearn
from imblearn.over_sampling import SMOTE
pd.options.display.max_columns = None
import warnings
warnings.filterwarnings("ignore")
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from dotenv import load_dotenv
import dotenv 

print("Python: {}".format(sys.version))
print("Pandas: {}".format(pd.__version__))
print("mlflow: {}".format(mlflow.__version__))
print("XGBoost: {}".format(xgboost.__version__))
print("sklearn: {}".format(sklearn.__version__))
print("imblearn: {}".format(imblearn.__version__))


Python: 3.10.9 | packaged by conda-forge | (main, Jan 11 2023, 15:15:40) [MSC v.1916 64 bit (AMD64)]
Pandas: 2.1.1
mlflow: 2.7.1
XGBoost: 2.0.0
sklearn: 1.3.1
imblearn: 0.11.0


In [1]:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report
from imblearn.over_sampling import SMOTE
import pandas as pd
import mlflow
data = pd.read_csv("../data/train_data.csv")
data.head()


Unnamed: 0,amt,cc_num,hour,category,month,city_pop,state,is_fraud
0,0.390223,-0.318735,-1.144773,4,1.713973,-0.288315,43,0
1,-0.420863,-0.315007,-0.998099,8,-1.504564,-0.274707,43,0
2,5.77469,-0.318676,1.348692,11,0.89203,-0.265244,25,1
3,-0.154888,-0.318735,1.348692,7,0.251002,0.619883,9,0
4,0.479788,-0.185695,-1.737527,1,-0.81065,1.449675,24,1


In [2]:
X = data.drop('is_fraud', axis=1) 
y = data['is_fraud'] 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [172]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from datetime import datetime
from dateutil.relativedelta import relativedelta


In [173]:
data = pd.read_csv("../data/fraudTrain.csv")
datatest = pd.read_csv("../data/fraudTest.csv")

In [3]:
data['trans_date_trans_time'] = pd.to_datetime(data['trans_date_trans_time'], format='%Y-%m-%d %H:%M:%S')
data['year'] = data['trans_date_trans_time'].dt.year
data['month'] = data['trans_date_trans_time'].dt.month
data['day'] = data['trans_date_trans_time'].dt.day
data['hour'] = data['trans_date_trans_time'].dt.hour
data['minute'] = data['trans_date_trans_time'].dt.minute
data['second'] = data['trans_date_trans_time'].dt.second
data['dob'] = pd.to_datetime(data['dob'])
current_date = datetime.now()
data['age'] = data['dob'].apply(lambda x: relativedelta(current_date, x).years)

In [174]:
def process_transaction_data(data):
    data['trans_date_trans_time'] = pd.to_datetime(data['trans_date_trans_time'], format='%Y-%m-%d %H:%M:%S')
    # Extract year, month, day, hour, minute, and second
    data['year'] = data['trans_date_trans_time'].dt.year
    data['month'] = data['trans_date_trans_time'].dt.month
    data['day'] = data['trans_date_trans_time'].dt.day
    data['hour'] = data['trans_date_trans_time'].dt.hour
    data['minute'] = data['trans_date_trans_time'].dt.minute
    data['second'] = data['trans_date_trans_time'].dt.second
    # Convert 'dob' to datetime
    data['dob'] = pd.to_datetime(data['dob'])
    # Calculate age based on 'dob'
    current_date = datetime.now()
    data['age'] = data['dob'].apply(lambda x: relativedelta(current_date, x).years)
    return data

In [176]:
%%time
data = process_transaction_data(data)
datatest = process_transaction_data(datatest)

Wall time: 1min 13s


## train_test_split

In [177]:
ohe=OneHotEncoder()
X=data[["amt" ,    
"cc_num"        ,
"hour"          ,
"category"      ,
"month"         ,
"city_pop"      ,
        'age','gender',
        "zip",
"state"       ,
      "job"  
      ,"is_fraud"]]
X.drop("is_fraud", axis=1, inplace=True)
y=data["is_fraud"]

#X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

In [230]:
X_train=data[["amt" ,"cc_num","hour","category","month","city_pop",'age','gender',"zip","state","job"]]
y_train=data["is_fraud"]

datatest = datatest[
    datatest['category'].isin(data['category']) &
    datatest['job'].isin(data['job']) &
    datatest['state'].isin(data['state']) &
    datatest['gender'].isin(data['gender'])
]


y_test=datatest["is_fraud"]
X_test=datatest[["amt" ,"cc_num","hour","category","month","city_pop",'age','gender',"zip","state","job"]]

In [231]:
merged_data = pd.concat([data[['category', 'state', 'gender', 'job']], datatest[['category', 'state', 'gender', 'job']]])

ohe = OneHotEncoder(handle_unknown='ignore')
ohe.fit(merged_data)


column_trans = make_column_transformer(
    (OneHotEncoder(categories=ohe.categories_), ['category', 'state',"gender","job"]),
    (StandardScaler(), ["amt",'cc_num', 'hour',"month","city_pop","zip","age"]),
    remainder='passthrough'
)

In [191]:

# Create and fit the pipeline
pipe = make_pipeline(column_trans, rf)
pipe.fit(X_train, y_train)

# Predict and evaluate the model
y_pred = pipe.predict(X_test)
print('Random forest')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:\n', classification_report(y_test, y_pred))


Random forest
F1 Score: 0.7231308411214953
Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.97      0.58      0.72      2145

    accuracy                           1.00    555719
   macro avg       0.98      0.79      0.86    555719
weighted avg       1.00      1.00      1.00    555719



---
<a id="rf"></a><p style="line-height: 2; font-size: 25px; font-weight: bold; letter-spacing: 2px; text-align: center;"> RandomForestClassifier
</p>


 <p><a href="#0" style="background-color: #e7e7e7;
  color: #008CBA;
  border: none;
  padding: 5px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 16px;
  margin: 4px 2px;
  cursor: pointer;
  border-radius: 100%;">Back</a></p>
  
  


In [186]:
rf = RandomForestClassifier(n_estimators=20,max_depth=50,criterion='entropy', random_state=0)
pipe=make_pipeline(column_trans,rf)

In [187]:
%%time
pipe.fit(X_train,y_train)

Wall time: 5min 18s


In [188]:
y_pred=pipe.predict(X_test)
from sklearn.metrics import f1_score,classification_report
print('Random foreast')
print(f1_score(y_test,y_pred))
print(classification_report(y_test,y_pred))


ValueError: Found unknown categories ['Operational investment banker', 'Engineer, water', 'Software engineer'] in column 3 during transform

In [12]:
y_pred=pipe.predict(X_test)
from sklearn.metrics import f1_score,classification_report
print('Random foreast')
print(f1_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

Random foreast
0.8069896743447181
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257844
           1       0.99      0.68      0.81      1491

    accuracy                           1.00    259335
   macro avg       0.99      0.84      0.90    259335
weighted avg       1.00      1.00      1.00    259335



In [53]:
y_pred=pipe.predict(X_test)
from sklearn.metrics import f1_score,classification_report
print('Random foreast')
print(f1_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

Random foreast
0.8209285187914517
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257761
           1       0.98      0.71      0.82      1574

    accuracy                           1.00    259335
   macro avg       0.99      0.85      0.91    259335
weighted avg       1.00      1.00      1.00    259335



In [47]:
y_pred=pipe.predict(X_test)
from sklearn.metrics import f1_score,classification_report
print('Random foreast')
print(f1_score(y_test,y_pred))
print(classification_report(y_test,y_pred))

Random foreast
0.825654257279764
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257761
           1       0.98      0.71      0.83      1574

    accuracy                           1.00    259335
   macro avg       0.99      0.86      0.91    259335
weighted avg       1.00      1.00      1.00    259335



---
<a id="knn"></a><p style="line-height: 2; font-size: 25px; font-weight: bold; letter-spacing: 2px; text-align: center;"> KNeighborsClassifier
</p>


 <p><a href="#0" style="background-color: #e7e7e7;
  color: #008CBA;
  border: none;
  padding: 5px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 16px;
  margin: 4px 2px;
  cursor: pointer;
  border-radius: 100%;">Back</a></p>
  
  


In [11]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=7)  
pipe = make_pipeline(column_trans, knn)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
from sklearn.metrics import f1_score, classification_report

print('K-Nearest Neighbors')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

K-Nearest Neighbors
F1 Score: 0.46399226679555344
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257874
           1       0.79      0.33      0.46      1461

    accuracy                           1.00    259335
   macro avg       0.89      0.66      0.73    259335
weighted avg       1.00      1.00      0.99    259335



---
<a id="knn"></a><p style="line-height: 2; font-size: 25px; font-weight: bold; letter-spacing: 2px; text-align: center;"> LogisticRegression
</p>


 <p><a href="#0" style="background-color: #e7e7e7;
  color: #008CBA;
  border: none;
  padding: 5px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 16px;
  margin: 4px 2px;
  cursor: pointer;
  border-radius: 100%;">Back</a></p>
  
  


In [233]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(random_state=0)

pipe = make_pipeline(column_trans, log_reg)

In [234]:
%%time
pipe.fit(X_train, y_train)

Wall time: 9.41 s


In [235]:
y_pred = pipe.predict(X_test)
from sklearn.metrics import f1_score, classification_report
print('Logistic Regression')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

Logistic Regression
F1 Score: 0.01768421052631579
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.08      0.01      0.02      2115

    accuracy                           1.00    555689
   macro avg       0.54      0.50      0.51    555689
weighted avg       0.99      1.00      0.99    555689



---
<a id="knn"></a><p style="line-height: 2; font-size: 25px; font-weight: bold; letter-spacing: 2px; text-align: center;"> XGBClassifier
</p>


 <p><a href="#0" style="background-color: #e7e7e7;
  color: #008CBA;
  border: none;
  padding: 5px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 16px;
  margin: 4px 2px;
  cursor: pointer;
  border-radius: 100%;">Back</a></p>
  
  


In [9]:
pip install xgboost

Collecting xgboost
  Using cached xgboost-2.0.0-py3-none-win_amd64.whl (99.7 MB)
Installing collected packages: xgboost
Successfully installed xgboost-2.0.0
Note: you may need to restart the kernel to use updated packages.


In [192]:
from xgboost import XGBClassifier

column_trans = make_column_transformer(
    (OneHotEncoder(categories=ohe.categories_), ['category', 'state',"gender","job"]),
    (StandardScaler(), ["amt",'cc_num', 'hour',"month","city_pop","zip","age"]),
    remainder='passthrough'
)

xgb_classifier = XGBClassifier(n_estimators=20, max_depth=50, criterion='entropy', random_state=0)

pipe = make_pipeline(column_trans, xgb_classifier)
pipe.fit(X_train, y_train)

In [193]:
%%time
pipe.fit(X_train, y_train)

Wall time: 7.37 s


In [194]:
y_pred = pipe.predict(X_test)

from sklearn.metrics import f1_score, classification_report

print('XGBoost Classifier')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

XGBoost Classifier
F1 Score: 0.8303413400758534
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.91      0.77      0.83      2145

    accuracy                           1.00    555719
   macro avg       0.95      0.88      0.91    555719
weighted avg       1.00      1.00      1.00    555719



In [36]:
pip uninstall scikit-optimize


^C
Note: you may need to restart the kernel to use updated packages.


In [145]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

xgb_classifier = XGBClassifier()

pipe = make_pipeline(column_trans, xgb_classifier)
param_dist = {
    'xgbclassifier__n_estimators':  [10, 20, 30, 40, 50, 60, 70,80,90,100], 
    'xgbclassifier__max_depth': [5,10,20,30,40,50,60,70,80,90,100] ,
    'xgbclassifier__learning_rate': [0.01, 0.1, 0.2, 0.3, 0.5, 0.4, 0.6],
    'xgbclassifier__subsample': [0.7, 0.8, 0.9, 1, 1.1],
    'xgbclassifier__colsample_bytree': [0.5,0.6,0.7, 0.8, 0.9, 1.0],
        'xgbclassifier__criterion': ['gini', 'entropy']

}

random_search = RandomizedSearchCV(pipe, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='f1', random_state=0, n_jobs=-1)

In [198]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

xgb_classifier = XGBClassifier()

pipe = make_pipeline(column_trans, xgb_classifier)
param_dist = {
    'xgbclassifier__n_estimators':  [10, 20, 30, 40, 50, 60, 70,80,90,100], 
    'xgbclassifier__max_depth': [5,10,20,30,40,50,60,70,80,90,100] ,
    'xgbclassifier__learning_rate': [0.01, 0.1, 0.2, 0.3, 0.5, 0.4, 0.6],
    'xgbclassifier__subsample': [0.7, 0.8, 0.9, 1, 1.1],
    'xgbclassifier__colsample_bytree': [0.5,0.6,0.7, 0.8, 0.9, 1.0],
        'xgbclassifier__criterion': ['gini', 'entropy']

}

random_search = RandomizedSearchCV(pipe, param_distributions=param_dist, n_iter=10, cv=5,
                                   scoring='f1', random_state=0, n_jobs=-1)

In [199]:
%%time
random_search.fit(X_train, y_train)

Wall time: 6min 37s


In [200]:
print("Best Parameters: ", random_search.best_params_)
print("Best Score: ", random_search.best_score_)

Best Parameters:  {'xgbclassifier__subsample': 0.9, 'xgbclassifier__n_estimators': 40, 'xgbclassifier__max_depth': 20, 'xgbclassifier__learning_rate': 0.2, 'xgbclassifier__criterion': 'entropy', 'xgbclassifier__colsample_bytree': 1.0}
Best Score:  0.8513338779669523


In [201]:
best_params = random_search.best_params_

best_xgb_classifier = XGBClassifier(
    n_estimators=best_params['xgbclassifier__n_estimators'],
    max_depth=best_params['xgbclassifier__max_depth'],
    learning_rate=best_params['xgbclassifier__learning_rate'],
    subsample=best_params['xgbclassifier__subsample'],
    colsample_bytree=best_params['xgbclassifier__colsample_bytree'],
    criterion=best_params['xgbclassifier__criterion'],
    random_state=37
)

best_pipe = make_pipeline(column_trans, best_xgb_classifier)


In [202]:
pipe.fit(X_train, y_train)

In [203]:
y_pred = pipe.predict(X_test)

from sklearn.metrics import f1_score, classification_report

print('XGBoost Classifier')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

XGBoost Classifier
F1 Score: 0.8342412451361867
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.94      0.75      0.83      2145

    accuracy                           1.00    555719
   macro avg       0.97      0.87      0.92    555719
weighted avg       1.00      1.00      1.00    555719



In [146]:
%%time
random_search.fit(X_train, y_train)


Wall time: 5min


In [147]:
print("Best Parameters: ", random_search.best_params_)
print("Best Score: ", random_search.best_score_)

Best Parameters:  {'xgbclassifier__subsample': 1, 'xgbclassifier__n_estimators': 90, 'xgbclassifier__max_depth': 60, 'xgbclassifier__learning_rate': 0.3, 'xgbclassifier__criterion': 'entropy', 'xgbclassifier__colsample_bytree': 0.7}
Best Score:  0.8919693277908392


In [83]:
%%time
random_search.fit(X_train, y_train)

Wall time: 4min 28s


In [97]:
print("Best Parameters: ", random_search.best_params_)
print("Best Score: ", random_search.best_score_)

Best Parameters:  {'xgbclassifier__subsample': 0.8, 'xgbclassifier__random_state': 37, 'xgbclassifier__n_estimators': 70, 'xgbclassifier__max_depth': 90, 'xgbclassifier__learning_rate': 0.6, 'xgbclassifier__criterion': 'entropy', 'xgbclassifier__colsample_bytree': 0.7}
Best Score:  0.8926565077811187


In [74]:
xgb_classifier = XGBClassifier(n_estimators=70, max_depth=30, criterion='entropy', random_state=0)

pipe = make_pipeline(column_trans, xgb_classifier)

In [195]:
best_params = random_search.best_params_

best_xgb_classifier = XGBClassifier(
    n_estimators=best_params['xgbclassifier__n_estimators'],
    max_depth=best_params['xgbclassifier__max_depth'],
    learning_rate=best_params['xgbclassifier__learning_rate'],
    subsample=best_params['xgbclassifier__subsample'],
    colsample_bytree=best_params['xgbclassifier__colsample_bytree'],
    criterion=best_params['xgbclassifier__criterion'],
    random_state=37
)

best_pipe = make_pipeline(column_trans, best_xgb_classifier)


In [196]:
%%time
pipe.fit(X_train, y_train)

Wall time: 8.33 s


In [197]:
y_pred = pipe.predict(X_test)

from sklearn.metrics import f1_score, classification_report

print('XGBoost Classifier')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

XGBoost Classifier
F1 Score: 0.8303413400758534
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.91      0.77      0.83      2145

    accuracy                           1.00    555719
   macro avg       0.95      0.88      0.91    555719
weighted avg       1.00      1.00      1.00    555719



In [162]:
%%time
pipe.fit(X_train, y_train)

Wall time: 7.11 s


In [163]:
y_pred = pipe.predict(X_test)

from sklearn.metrics import f1_score, classification_report

print('XGBoost Classifier')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

XGBoost Classifier
F1 Score: 0.888086642599278
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    257844
           1       0.96      0.82      0.89      1491

    accuracy                           1.00    259335
   macro avg       0.98      0.91      0.94    259335
weighted avg       1.00      1.00      1.00    259335



---
<a id="knn"></a><p style="line-height: 2; font-size: 25px; font-weight: bold; letter-spacing: 2px; text-align: center;"> Mlflow
</p>


 <p><a href="#0" style="background-color: #e7e7e7;
  color: #008CBA;
  border: none;
  padding: 5px;
  text-align: center;
  text-decoration: none;
  display: inline-block;
  font-size: 16px;
  margin: 4px 2px;
  cursor: pointer;
  border-radius: 100%;">Back</a></p>
  
  


In [222]:
import os
from dotenv import load_dotenv

load_dotenv()

mlflow_username = os.getenv('MLFLOW_TRACKING_USERNAME')
mlflow_password = os.getenv('MLFLOW_TRACKING_PASSWORD')

print(f'MLFLOW_TRACKING_USERNAME: {mlflow_username}')
print(f'MLFLOW_TRACKING_PASSWORD: {mlflow_password}')

MLFLOW_TRACKING_USERNAME: islembenmaalem
MLFLOW_TRACKING_PASSWORD: 580d73690283aa12650dff07f3881600d00f83c3


In [223]:
import os
os.environ['MLFLOW_TRACKING_USERNAME']= mlflow_username
os.environ["MLFLOW_TRACKING_PASSWORD"] = mlflow_password

In [224]:
import mlflow

In [225]:
#setup mlflow
mlflow.set_tracking_uri('https://dagshub.com/islembenmaalem/mlops_project.mlflow')
mlflow.set_experiment("idsd-sd-experiment")

<Experiment: artifact_location='mlflow-artifacts:/b53b17c7e7a64968a9e141ada900d774', creation_time=1697399269558, experiment_id='0', last_update_time=1697399269558, lifecycle_stage='active', name='idsd-sd-experiment', tags={}>

In [219]:
version = "v1.0"
data_url ="../Data/train_data.csv"

In [232]:
data = pd.read_csv("../data/fraudTrain.csv")
datatest = pd.read_csv("../data/fraudTest.csv")
datatest = datatest[
    datatest['category'].isin(data['category']) &
    datatest['job'].isin(data['job']) &
    datatest['state'].isin(data['state']) &
    datatest['gender'].isin(data['gender'])
]

data = process_transaction_data(data)
datatest = process_transaction_data(datatest)


X_train=data[["amt" ,"cc_num","hour","category","month","city_pop",'age','gender',"zip","state","job"]]
y_train=data["is_fraud"]

y_test=datatest["is_fraud"]
X_test=datatest[["amt" ,"cc_num","hour","category","month","city_pop",'age','gender',"zip","state","job"]]

In [None]:
%%time
log_reg = LogisticRegression(random_state=0)

pipe = make_pipeline(column_trans, log_reg)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
from sklearn.metrics import f1_score, classification_report
print('Logistic Regression')
print('F1 Score:', f1_score(y_test, y_pred))
print('Classification Report:')
print(classification_report(y_test, y_pred))

In [228]:
X_train.columns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_recall_fscore_support as score

In [229]:
mlflow.sklearn.autolog(disable=True)

In [255]:
import mlflow
import mlflow.sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.metrics import precision_recall_fscore_support
import time

# Assuming you have X_train, y_train, X_test, y_test, column_trans, and log_reg defined

# Create a Logistic Regression model
log_reg = LogisticRegression(random_state=0)

# Create a pipeline
pipe = make_pipeline(column_trans, log_reg)

# Start an MLflow run with the specified run name
with mlflow.start_run(run_name='LogisticRegression'):
    # Start timing the fit operation
    start_time = time.time()

    # Train the model
    pipe.fit(X_train, y_train)

    # Calculate the duration of the fit operation
    fit_duration = time.time() - start_time

    # Log parameters for this specific run
    mlflow.log_param('train_data_size', len(X_train))
    mlflow.log_param('test_data_size', len(X_test))
    mlflow.log_param('fit_duration', fit_duration)
    mlflow.log_param('column_names', X_test.columns.tolist())
    params = log_reg.get_params()
    mlflow.log_params(params)

    # Predict on the test set
    y_pred = pipe.predict(X_test)

    # Calculate F1 Score
    f1 = f1_score(y_test, y_pred)

    # Calculate precision, recall, F1 score, and support
    precision, recall, fscore, support = precision_recall_fscore_support(y_test, y_pred, average='macro')

    # Calculate confusion matrix
    cm = confusion_matrix(y_test, y_pred)

    mlflow.set_tag(key="model", value="LogisticRegression")

    # Log the model
    mlflow.sklearn.log_model(pipe, artifact_path="ML_models")

    # Log metrics
    mlflow.log_metric('f1_score', f1)
    mlflow.log_metric("Precision_test", precision)
    mlflow.log_metric("Recall_test", recall)
    mlflow.log_metric("F1_score_test", fscore)

  

    # Generate the classification report
    report = classification_report(y_test, y_pred)

    # Log the classification report as text
    mlflow.log_text(report, "classification_report")

    print('Logistic Regression')
    print('F1 Score:', f1)
    print('Precision:', precision)
    print('Recall:', recall)
    print('F1 Score:', fscore)
    print('Confusion Matrix:')
    print(cm)
    print('Classification Report:')
    print(report)


Logistic Regression
F1 Score: 0.01768421052631579
Precision: 0.5384995859749159
Recall: 0.504748669042101
F1 Score: 0.5077902595963744
Confusion Matrix:
[[553335    239]
 [  2094     21]]
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    553574
           1       0.08      0.01      0.02      2115

    accuracy                           1.00    555689
   macro avg       0.54      0.50      0.51    555689
weighted avg       0.99      1.00      0.99    555689



In [256]:
#Reading Pandas Dataframe from mlflow
all_experiments = [exp.experiment_id for exp in mlflow.search_experiments()]
df_mlflow = mlflow.search_runs(experiment_ids=all_experiments,filter_string="metrics.F1_score_test <1")
run_id = df_mlflow.loc[df_mlflow['metrics.F1_score_test'].idxmax()]['run_id']
print(run_id)

f10f1c95a18442849dc9cf9ad23877b7


In [257]:
df_mlflow

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.F1_score_test,metrics.Precision_test,metrics.Recall_test,metrics.f1_score,params.tol,params.multi_class,params.train_data_size,params.intercept_scaling,params.column_names,params.random_state,params.verbose,params.class_weight,params.penalty,params.test_data_size,params.warm_start,params.n_jobs,params.l1_ratio,params.solver,params.dual,params.fit_duration,params.fit_intercept,params.max_iter,params.C,tags.model,tags.mlflow.source.type,tags.mlflow.runName,tags.mlflow.source.name,tags.mlflow.user,tags.mlflow.log-model.history
0,f10f1c95a18442849dc9cf9ad23877b7,0,FINISHED,mlflow-artifacts:/b53b17c7e7a64968a9e141ada900...,2023-10-15 20:49:30.642000+00:00,2023-10-15 20:50:26.329000+00:00,0.50779,0.5385,0.504749,0.017684,0.0001,auto,1296675,1,"['amt', 'cc_num', 'hour', 'category', 'month',...",0,0,,l2,555689,False,,,lbfgs,False,9.655415058135986,True,100,1.0,LogisticRegression,LOCAL,LogisticRegression,C:\Users\MSI\anaconda3\envs\mlops\lib\site-pac...,islembenmaalem,"[{""run_id"": ""f10f1c95a18442849dc9cf9ad23877b7""..."


In [258]:
#let's call the model from the model registry ( in production stage)
import mlflow.pyfunc

logged_model = f'runs:/{run_id}/ML_models'

# Load model as a PyFuncModel.
loaded_model = mlflow.pyfunc.load_model(logged_model)
print(loaded_model)

# Predict on a Pandas DataFrame.

loaded_model.predict(X_test)

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

mlflow.pyfunc.loaded_model:
  artifact_path: ML_models
  flavor: mlflow.sklearn
  run_id: f10f1c95a18442849dc9cf9ad23877b7



array([0, 0, 0, ..., 0, 0, 0], dtype=int64)