Imbalanced data refers to those types of datasets where the target class has an uneven distribution of observations, i.e one class label has a very high number of observations and the other has a very low number of observations.

Credit card fraud happens when consumers give their credit card number to unfamiliar individuals, when cards are lost or stolen, when mail is diverted from the intended recipient and taken by criminals, or when employees of a business copy the cards or card numbers of a cardholder. In this notebook we will develop a few ML models using anonymized credit card transaction data. The challenge behind fraud detection is that frauds are far less common as compared to legal transactions

IMPORTING LIBRARIES AND DEPENDANCIES

In [70]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler,MinMaxScaler,RobustScaler,LabelEncoder,OneHotEncoder
from sklearn.model_selection import train_test_split,KFold,RandomizedSearchCV,GridSearchCV,cross_val_score
from sklearn.tree import DecisionTreeClassifier,plot_tree
from sklearn.ensemble import RandomForestClassifier,BaggingClassifier,AdaBoostClassifier,GradientBoostingClassifier
import xgboost as xgb

In [None]:
pip install catboost

Collecting catboost
  Downloading catboost-1.2-cp310-cp310-manylinux2014_x86_64.whl (98.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.6/98.6 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2


In [None]:
pip install lightgbm



In [None]:
import catboost as cb
import lightgbm as lgb

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


IMPORTING THE FILE

In [71]:
df=pd.read_csv('creditcard.csv',index_col=0)

PERFORMING EDA AND DATA ANALYSIS

In [72]:
df.shape

(277215, 30)

In [None]:
df.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class'],
      dtype='object')

In [None]:
df.dtypes

In [None]:
df.isnull().sum()

In [None]:
df.corr()

In [None]:
sns.pairplot(df)

In [None]:
def plotCorrelationMatrix(df, figsize):
    # Compute the correlation matrix
    corr = df.corr()

    # Set up the matplotlib figure
    plt.figure(figsize=(figsize, figsize))

    # Create a heatmap using Seaborn
    sns.heatmap(corr, annot=True, cmap='coolwarm', center=0, linewidths=0.5, square=True)

    # Add a title to the plot
    plt.title('Correlation Matrix')

    # Show the plot
    plt.show()

# Assuming you have a DataFrame called 'df'
# If not, replace 'df' with the name of your DataFrame
plotCorrelationMatrix(df,30)

In [None]:
df.describe()

In [None]:
df['Class'].value_counts().plot(kind='bar')#data is imbalance

In [None]:
class_counts=df['Class'].value_counts()
#percentage of fraud and non fraud
percentage=(class_counts/len(df))*100

In [None]:
percentage

0.0    99.817156
1.0     0.182420
Name: Class, dtype: float64

In [None]:
percentage.plot(kind='bar')

As there are many columns so lets check which columns are more important to us through VIF-Variance inflation factor

The variance of columns (features) in a dataset is important in feature selection for several reasons:

Identifying Low-Variance Features: Features with very low variance may not contain much useful information for modeling. They might carry little or no variability across the samples, making them less informative and less likely to contribute to the prediction process. By removing low-variance features, we can simplify the model and potentially improve its performance.

Avoiding Overfitting: High-variance features can potentially lead to overfitting, where the model learns the noise or specific patterns in the training data that do not generalize well to new data. By removing such features, we can reduce the complexity of the model and prevent overfitting.

Reducing Computational Complexity: Removing low-variance features can reduce the computational resources required to train the model, as the model has fewer parameters to estimate.

Multicollinearity: In regression models, high multicollinearity among predictor variables can lead to unstable parameter estimates and reduce the interpretability of the model. High-variance features can be an indicator of potential multicollinearity.

Focus on Relevant Features: In some cases, we might be interested in identifying the most relevant features that have significant variability across the dataset. By selecting features with higher variance, we can focus on the most informative aspects of the data.

Feature Importance: In ensemble methods like Random Forest and Gradient Boosting, the importance of features is often based on their contribution to reducing impurity (e.g., Gini impurity or entropy). High-variance features might have higher importance scores in such models.

However, it's important to note that the importance of variance as a feature selection criterion depends on the problem and the type of data. For example, in some cases, low-variance features might be crucial for specific tasks, such as identifying constant or near-constant features that carry specific information. Therefore, it's essential to consider various feature selection techniques and domain knowledge to make informed decisions about feature inclusion or exclusion in a model.

In [None]:
# Calculate the variance for each column (excluding 'Amount')
variance_data = np.var(df.drop('Amount', axis=1))

# Create a bar plot of the variances
variance_data.plot(kind='bar')

# Add labels and title
plt.xlabel('Columns')
plt.ylabel('Variance')
plt.title('Variance of Each Column')

# Show the plot
plt.show()

In [None]:
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectKBest, mutual_info_classif,RFE,chi2
threashold=0.2
var_thres=VarianceThreshold(threashold)
var_thres.fit(df.drop('Class',axis=1))

In [None]:
var_thres.get_support()

DATA PRE PROCESSING OR DATA CLEANING

In [74]:
df=df.dropna()

In [None]:
# Check for duplicate rows in the DataFrame
duplicate_rows = df[df.duplicated()]

# Print the duplicate rows (if any)
if not duplicate_rows.empty:
    print("Duplicate Rows:")
    print(duplicate_rows)
else:
    print("No Duplicate Rows Found.")

In [75]:
df.drop_duplicates(inplace=True)

In [76]:
df.shape

(268446, 30)

In [None]:
df.eq(0).sum()

REPLACING 0 VALUES OF AMOUNT COLUMN WITH MEAN VALUE

In [77]:
df['Amount'] = df['Amount'].replace(0,df['Amount'].mean())

In [None]:
df.eq(0).sum()

In [None]:
df.describe()

PLOTTING A SUB PLOT TO KNOW THE DISTRIBUTION OF THE DATA

In [None]:
# Get the list of column names
column_names = df.columns

# Set the number of rows and columns for subplots
num_rows = len(column_names)
num_cols = 1

# Create subplots
fig, axs = plt.subplots(num_rows, num_cols, figsize=(8, num_rows*4))

# Iterate over each column
for i, column in enumerate(column_names):
    # Select the appropriate subplot
    ax = axs[i] if num_rows > 1 else axs

    # Create a histplot for the column
    sns.histplot(data=df, x=column, ax=ax)

    # Set labels and title for each subplot
    ax.set_xlabel('X-axis Label')
    ax.set_ylabel('Y-axis Label')
    ax.set_title(column)

# Adjust the spacing between subplots
plt.tight_layout()

# Display the plot
plt.show()

TREATING THE OUT LIERS
Certainly! The Z-score method is a common technique used to identify and treat outliers in your dataset. It involves calculating the Z-score for each data point and then determining whether it falls above or below a certain threshold. Here's how you can use the Z-score method to treat outliers in your DataFrame:

In [None]:
def remove_outliers_zscore(df, threshold=3):
    z_scores = np.abs((df - np.mean(df)) / np.std(df))
    outliers_mask = z_scores > threshold
    df = df[~outliers_mask]
    return df


In [None]:
remove_outliers_zscore(df,threshold=3)

In [None]:
df.shape

(275663, 30)

In [None]:
df.isnull().sum()

DEFINING X AND Y VARIABLE

In [78]:
x=df.drop('Class',axis=1)

In [79]:
x=x.dropna()

In [80]:
y=df['Class']

In [81]:
y=y.dropna()

In [None]:
x.columns

Index(['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount'],
      dtype='object')

SPLITTING THE DATA INTO TRAIN AND TEST

In [None]:
x_test,x_train,y_test,y_train=train_test_split(x,y,train_size=0.20,random_state=42)

In [None]:
print(x_test.shape,x_train.shape,y_test.shape,y_train.shape)

(45756, 29) (183028, 29) (45756,) (183028,)


SCALING THE DATA

In [None]:
scaler=RobustScaler()
x_train[['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']]=scaler.fit_transform(x_train[['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']])

In [None]:
x_test[['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']]=scaler.fit_transform(x_test[['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11',
       'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21',
       'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount']])

In [None]:
x_train.describe()

MODEL-1 LOGISTIC REGRESSION FITTING WITHOUT ANY HYPER PARAMETER

In [None]:
model1=LogisticRegression()

In [None]:
model1.fit(x_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
ypts=model1.predict(x_test)#y predict for test data

In [None]:
yptr=model1.predict(x_train)#y predict for train data

APPLYING CROSS VALIDATION TECHNIQUE

In [None]:
cv_score=cross_val_score(model1,x_train,y_train,cv=10,scoring='f1_weighted')
print(cv_score)
print(cv_score.mean())

In [None]:
from sklearn.metrics import accuracy_score,recall_score,confusion_matrix,f1_score

In [None]:
print("Test Accuracy",accuracy_score(y_test,ypts))

Test Accuracy 0.9990602325378093


In [None]:
print("Train Accuracy",accuracy_score(y_train,yptr))

Train Accuracy 0.9990984985903796


In [None]:
print("Train Accuracy",f1_score(y_train,yptr))

Train Accuracy 0.7005444646098004


In [None]:
print("Test Accuracy",f1_score(y_test,ypts))

Test Accuracy 0.7114093959731543


LETS CHECK WITH PRINCIPAL COMPONENT ANALYSIS

In [None]:
from sklearn.decomposition import PCA

In [None]:
pca=PCA(n_components=3)

In [None]:
x_pca=pca.fit_transform(x)

In [None]:
x_pca=pd.DataFrame(x_pca)

In [None]:
x_pca.head()

Unnamed: 0,0,1,2
0,55.770358,1.140037,0.480344
1,-91.162753,-1.090921,-0.158182
2,284.815596,0.641822,0.678241
3,29.650608,0.709155,0.55805
4,-23.861352,1.119146,-0.394007


In [None]:
x_pca_scaled=scaler.fit_transform(x_pca)

In [None]:
x_pca_scaled=pd.DataFrame(x_pca_scaled)

In [None]:
x_pca_scaled.head(5)

Unnamed: 0,0,1,2
0,1.491032,0.311512,0.743311
1,-0.309163,-0.032254,-0.533033
2,4.297239,-0.157576,0.503914
3,1.171019,-0.079667,0.522101
4,0.515406,0.697994,0.649302


In [None]:
x_pca_scaled_test,x_pca_scaled_train,y_test,y_train=train_test_split(x_pca_scaled,y,train_size=0.20,random_state=42)

In [None]:
x_pca_scaled_train.shape

(183028, 3)

In [None]:
model1.fit(x_pca_scaled_train,y_train)

In [None]:
ypts2=model1.predict(x_pca_scaled_test)#y predict for test data

In [None]:
yptr2=model1.predict(x_pca_scaled_train)#y predict for training data

In [None]:
print("Train Accuracy with applying Principal component analysis",f1_score(y_train,yptr2))

Train Accuracy with applying Principal component analysis 0.31295843520782396


In [None]:
print("Test Accuracy with Applying Principal component analysis",f1_score(y_test,ypts2))

Test Accuracy with Applying Principal component analysis 0.39669421487603307



Here's what's happening in this case:

Low Training Accuracy: The model is not performing well on the training data itself, which suggests that it's struggling to capture the underlying patterns and relationships present in the data. It's potentially too complex or flexible.

High Testing Accuracy: Despite the low training accuracy, the model performs well on the testing data. This might happen if the noise and fluctuations in the training data are specific to that dataset and do not generalize to other data points.

Common reasons for this scenario include:

Complex Model: The model you're using might be too complex and able to memorize the training data. Decision trees, for instance, can easily overfit if they are allowed to grow too deep.

Insufficient Training Data: When you have a small amount of training data, it becomes easier for a complex model to memorize it, leading to overfitting. The model won't generalize well to new data.

Noisy Data: If your training data has a lot of noise or outliers, the model might fit those noise points as well. These noise points won't appear in the test data, leading to better generalization.

Feature Engineering: It's possible that you have engineered features that capture noise in the data, causing the model to overfit.

To address this issue and improve your model's performance:

Use Simpler Models: Consider using simpler algorithms or models with fewer parameters, which are less prone to overfitting.

Regularization: Apply regularization techniques like L1 or L2 regularization to penalize overly complex models.

Cross-Validation: Use techniques like k-fold cross-validation to evaluate your model's performance on multiple subsets of the data.

More Data: If possible, collect more data to provide the model with a broader range of examples to learn from.

Feature Selection/Engineering: Review and adjust your feature engineering process to ensure that the features you create are truly informative.

Pruning: If you're using a Decision Tree, consider applying pruning techniques to limit the tree's depth and complexity.

Remember, the goal is to strike a balance between the model's ability to fit the training data and its ability to generalize to new, unseen data. A high testing accuracy with a low training accuracy is a clear sign that this balance has not been achieved, and measures need to be taken to address overfitting.

APPLYING SAMPLING TECHNIQUES

Most machine learning algorithms work best when the number of samples in each class is about equal. This is because most algorithms are designed to maximize accuracy and reduce errors.

However, if the dataframe has imbalanced classes, then In such cases, you get a pretty high accuracy just by predicting the majority class, but you fail to capture the minority class, which is most often the point of creating the model in the first place. For example, if the class distribution shows that 99% of the data has the majority class, then any basic classification model like the logistic regression or decision tree will not be able to identify the minor class data points.

In [82]:
from imblearn.under_sampling import RandomUnderSampler

In [83]:
rus = RandomUnderSampler(random_state=42, replacement=True)# fit predictor and target variable
x_rus, y_rus = rus.fit_resample(x, y)

In [None]:
x_rus.shape

(822, 29)

In [None]:
y_rus.shape

(822,)

In [84]:
x_rus=scaler.fit_transform(x_rus)

In [None]:
model2=DecisionTreeClassifier()
# Perform 5-fold cross-validation and get accuracy scores
cv_scores = cross_val_score(model2, x_rus, y_rus, cv=20, scoring='f1_weighted')

# Print the accuracy scores for each fold
print("Cross-validation accuracy scores:", cv_scores)

# Calculate the average accuracy score over all folds
avg_accuracy = np.mean(cv_scores)
print("Average accuracy:", avg_accuracy)

Cross-validation accuracy scores: [0.92853091 0.78461538 0.97560976 0.95116144 0.90232288 0.8765839
 0.95116144 0.8498924  0.90208893 0.87804878 0.82906468 0.73074741
 0.82803779 0.90232288 0.92682927 0.926742   0.90126075 0.90232288
 0.90173883 0.87716985]
Average accuracy: 0.8863126083534247


APPLYING GRID SEARCH CV

In [None]:
x_rus_test,x_rus_train,y_rus_test,y_rus_train=train_test_split(x_rus,y_rus,train_size=0.20,random_state=42)

In [None]:
x_rus_test.shape

(164, 29)

In [None]:
x_rus_train.shape

(658, 29)

In [None]:
y_rus_train.shape

(658,)

In [None]:
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None,range(1,20)],
    'min_samples_split': range(1,20),
    'min_samples_leaf': range(1,20)}
grid_search = GridSearchCV(model2, param_grid, cv=5, scoring='f1_weighted')

# Perform hyperparameter tuning using Grid Search
grid_search.fit(x_rus_train, y_rus_train)

# Get the best hyperparameters and the corresponding best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict the target values for the testing set using the best model
ypts3 = best_model.predict(x_rus_test)
yptr3 = best_model.predict(x_rus_train)

# Calculate the accuracy score for the testing set
testing_accuracy = accuracy_score(y_rus_test, ypts3)
training_accuracy = accuracy_score(y_rus_train, yptr3)

# Print the best hyperparameters and the accuracy score for the testing set
print("Best Hyperparameters:", best_params)
print("Testing Accuracy:", testing_accuracy)
print("Training Accuracy:", training_accuracy)

Best Hyperparameters: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 19, 'min_samples_split': 3}
Testing Accuracy: 0.9146341463414634
Training Accuracy: 0.9422492401215805


3800 fits failed out of a total of 7220.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
190 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py", line 889, in fit
    super().fit(
  File "/usr/local/lib/python3.10/dist-packages/sklearn/tree/_classes.py", line 177, in fit
    self._validate_params()
  File "/usr/local/lib/python3.10/dist-packages/sklearn/base.py", line 600, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.10/dist-packages/sklea

AGAIN FITTING LOGISTIC REGRESSION ON SAMPLED DATA

In [None]:
model1.fit(x_rus_train,y_rus_train)

In [None]:
yptr3=model1.predict(x_rus_train)

In [None]:
ypts3=model1.predict(x_rus_test)

In [None]:
print("Training Accuracy",f1_score(y_rus_test,ypts3))

Training Accuracy 0.9112426035502958


In [None]:
print("Training Accuracy",f1_score(y_rus_train,yptr3))

Training Accuracy 0.9528301886792453


APPLYING SOME ENSAMBLE TECHNIQUES

TECHNIQUE-1 RANDOM FOREST CLASSIFIER WITH TRAIN TEST SPLIT

In [None]:
from sklearn.ensemble import BaggingClassifier,RandomForestClassifier

In [None]:
model4=RandomForestClassifier(n_estimators=100,max_depth=5,max_features=5,oob_score=True,bootstrap=True,criterion='gini',random_state=42)

In [None]:
model4.fit(x_rus_train,y_rus_train)

In [None]:
yptr4=model4.predict(x_rus_train)

In [None]:
ypts4=model4.predict(x_rus_test)

In [None]:
f1_score(y_rus_train,yptr4)

0.9620253164556963

In [None]:
f1_score(y_rus_test,ypts4)

0.9382716049382716

APPLYING RANDOM FOREST CLASSIFIER WITH CROSS VALIDATION

In [None]:
kfold=KFold(n_splits=10)
results = cross_val_score(model4, x_rus, y_rus,cv=kfold,scoring='f1_weighted')
print(results.mean())

0.9595659114491418


In [None]:
kfold=KFold(n_splits=10)
results = cross_val_score(model4, x_rus, y_rus,cv=10,scoring='f1_weighted')
print(results.mean())

0.9301537220375664


APPLYING RANDOM FOREST CLASSIFIER WITH GRID SEARCH CV

In [None]:
# Define the parameter grid you want to search
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [None, 10, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}
# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=model4, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit the grid search to the training data
grid_search.fit(x_rus_train,y_rus_train)

# Get the best parameters and the best model
best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

# Predict using the best model
y_pred = best_model.predict(x_rus_test)

# Print classification report
print("Best Parameters:", best_params)
print("Classification Report:\n", classification_report(y_test, y_pred))

Best Parameters: {'max_depth': 10, 'max_features': 'log2', 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}


NameError: ignored

Boosting is an ensemble modeling technique that was first presented by Freund and Schapire in the year 1997. Since then, Boosting has been a prevalent technique for tackling binary classification problems. These algorithms improve the prediction power by converting a number of weak learners to strong learners.

The principle behind boosting algorithms is that we first build a model on the training dataset and then build a second model to rectify the errors present in the first model. This procedure is continued until and unless the errors are minimized and the dataset is predicted correctly. Boosting algorithms work in a similar way, it combines multiple models (weak learners) to reach the final output (strong learners).

In [85]:
model4=AdaBoostClassifier(n_estimators=50,random_state=42)

In [86]:
model4.fit(x_rus_train,y_rus_train)

In [87]:
ypts=model4.predict(x_rus_test)

In [88]:
yptr=model4.predict(x_rus_train)

In [90]:
from sklearn.metrics import f1_score

In [91]:
test_accu=f1_score(y_rus_test,ypts)

In [92]:
test_accu

0.9357798165137615

APPLYING GRADIENT BOOSTING

In [100]:
model5=GradientBoostingClassifier(random_state=42,n_estimators=50)

In [102]:
model5.fit(x_rus_train,y_rus_train)

In [103]:
ypts=model5.predict(x_rus_test)

In [104]:
test_accu=f1_score(y_rus_test,ypts)

In [105]:
test_accu

0.9259259259259259