# Logistic Regression for Anti-Money Laundering (AML)

## Handling Duplicates and Assessing Degrees of Freedom




In this notebook, we created a synthetic dataset to test fraud detection algorithms for Anti-Money Laundering (AML). We highlighted the importance of including duplicates in the dataset, as they help the model learn to identify potential fraud.

We performed data cleaning and preprocessing to prepare the dataset for analysis. Then, we built a logistic regression model to classify transactions as legitimate or fraudulent using features like `UserID` and `Amount`.

We also discussed the role of degrees of freedom in understanding model complexity and evaluating performance.


## Part 1: Dataset Creation

The code below generates a synthetic dataset that simulates real-world scenarios for testing fraud detection algorithms, particularly for identifying duplicate financial transactions and salary or claim records.

The dataset consists of a total of 1,300 records, structured as follows:
- **Number of Unique Transaction Records**: 1,000
- **Number of Duplicate Records**: 300 (introduced to simulate fraud or entry errors)

Duplicates are introduced by randomly selecting existing records and creating new entries with slight variations:
- Each duplicate is assigned a new **TransactionID**.
- The **TransactionDate** is slightly modified by adding a random number of minutes (ranging from 1 to 60).

After generating the records, the dataset is shuffled to randomize the order of entries.

A new column, **IsDuplicate**, is added to indicate whether a record is a duplicate based on the combination of **UserID** and **Amount**. This column is set to 1 for duplicates and 0 for unique records.



In [34]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Set random seed for reproducibility
np.random.seed(42)

# Parameters
num_records = 1000 # Number of unique records
num_duplicates = 300  # Number of duplicates to introduce

# Generate unique records
user_ids = np.random.randint(1, 100, size=num_records)
amounts = np.random.uniform(100, 10000, size=num_records)
transaction_dates = [datetime.now() - timedelta(days=np.random.randint(0, 30)) for _ in range(num_records)]
transaction_types = np.random.choice(['salary', 'claim', 'purchase'], size=num_records)
locations = np.random.choice(['Location_A', 'Location_B', 'Location_C'], size=num_records)

# Create DataFrame
data = pd.DataFrame({
    'TransactionID': range(1, num_records + 1),
    'UserID': user_ids,
    'Amount': amounts,
    'TransactionDate': transaction_dates,
    'TransactionType': transaction_types,
    'Location': locations,
    'Status': 'completed'
})

# Introduce duplicates
for i in range(num_duplicates):
    # Randomly select a record to duplicate
    duplicate_record = data.sample(1).copy()
    duplicate_record['TransactionID'] = num_records + i + 1  # New ID for duplicate
    # Ensure the duplicate has the same UserID and Amount
    duplicate_record['TransactionDate'] += timedelta(minutes=np.random.randint(1, 60))  # Slightly alter date
    data = pd.concat([data, duplicate_record], ignore_index=True)

# Shuffle the dataset
data = data.sample(frac=1).reset_index(drop=True)

# Labeling duplicates based on UserID and Amount (this is appropriate for the use case)
data['IsDuplicate'] = data.duplicated(subset=['UserID', 'Amount'], keep=False).astype(int)

# Save to CSV
data.to_csv('synthetic_transactions.csv', index=False)

print("Synthetic dataset created and saved to 'synthetic_transactions.csv'.")


Synthetic dataset created and saved to 'synthetic_transactions.csv'.


In [28]:
data.head(2)

Unnamed: 0,TransactionID,UserID,Amount,TransactionDate,TransactionType,Location,Status,IsDuplicate
0,262,90,1230.955467,2025-06-01 00:57:02.661825,salary,Location_B,completed,1
1,1142,35,1111.905376,2025-05-29 01:52:02.660998,purchase,Location_C,completed,1


In [39]:
len(data)

1300

In [45]:
# Get the number of columns in the dataset
num_columns = len(data.columns)
print(num_columns)


8


In [36]:
# Summary statistics for numerical features
print(data.describe())

       TransactionID       UserID       Amount                TransactionDate  \
count    1300.000000  1300.000000  1300.000000                           1300   
mean      650.500000    49.546154  5120.406928  2025-05-19 06:46:24.638590720   
min         1.000000     1.000000   102.351483     2025-05-05 01:50:17.990577   
25%       325.750000    24.000000  2641.001010  2025-05-12 01:50:17.991203328   
50%       650.500000    49.500000  5219.979966  2025-05-19 01:50:17.992592384   
75%       975.250000    74.250000  7591.742530  2025-05-27 01:50:17.993413632   
max      1300.000000    99.000000  9993.599646     2025-06-03 03:36:17.993745   
std       375.421985    29.112960  2847.132820                            NaN   

       IsDuplicate  
count  1300.000000  
mean      0.403077  
min       0.000000  
25%       0.000000  
50%       0.000000  
75%       1.000000  
max       1.000000  
std       0.490705  


### Duplicates:

Including duplicates in the dataset is essential for training models aimed at identifying fraudulent transactions. The presence of duplicates allows the model to learn the characteristics of transactions that are likely to be flagged as duplicates or fraudulent.

The duplicates created in this dataset are not exact matches across all columns, as the `TransactionDate` has been modified. However, we label duplicates based on `UserID` and `Amount`, which is appropriate for this use case. By clearly labeling duplicates, we provide the model with explicit examples of what constitutes a duplicate transaction, which is crucial for effective training.




## Generic Data Cleaning

The data cleaning code below addresses several key aspects, including handling missing values, removing any unintended duplicates, and ensuring that the data types are appropriate for analysis:


In [37]:
import pandas as pd

# Load the dataset
data = pd.read_csv('synthetic_transactions.csv')

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# Handle missing values (if any)
# Fill missing values with appropriate methods
# Here, I'll fill missing 'Amount' with the mean, and drop any rows with missing 'UserID' or 'TransactionDate'
data['Amount'].fillna(data['Amount'].mean(), inplace=True)
data.dropna(subset=['UserID', 'TransactionDate'], inplace=True)

# Check for duplicate records
print("\nDuplicate Records Before Removal:")
print(data.duplicated().sum())

# Remove Duplicates using drop_duplicates method
data.drop_duplicates(inplace=True)


# Convert 'TransactionDate' to datetime
data['TransactionDate'] = pd.to_datetime(data['TransactionDate'])

# Check data types
print("\nData Types After Cleaning:")
print(data.dtypes)

# Reset index after cleaning
data.reset_index(drop=True, inplace=True)

# Display the cleaned dataset
print("\nCleaned Data:")
print(data.head(3))

# Save the cleaned dataset to a new CSV file
data.to_csv('cleaned_synthetic_transactions.csv', index=False)
print("\nCleaned dataset saved to 'cleaned_synthetic_transactions.csv'.")



Missing Values:
TransactionID      0
UserID             0
Amount             0
TransactionDate    0
TransactionType    0
Location           0
Status             0
IsDuplicate        0
dtype: int64

Duplicate Records Before Removal:
0

Data Types After Cleaning:
TransactionID               int64
UserID                      int64
Amount                    float64
TransactionDate    datetime64[ns]
TransactionType            object
Location                   object
Status                     object
IsDuplicate                 int64
dtype: object

Cleaned Data:
   TransactionID  UserID       Amount            TransactionDate  \
0            262      90  1230.955467 2025-06-01 01:50:17.991587   
1           1142      35  1111.905376 2025-05-29 02:45:17.991105   
2            463      55   539.328969 2025-05-12 01:50:17.992295   

  TransactionType    Location     Status  IsDuplicate  
0          salary  Location_B  completed            1  
1        purchase  Location_C  completed           

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Amount'].fillna(data['Amount'].mean(), inplace=True)



## Data Cleaning in the Context of Building a Logistic Regression Model for AML Detection

### Understanding Duplicates in AML Modeling

**Purpose of Duplicates**:
- Duplicates in the dataset can signify potential fraud or errors in transaction entries. In the context of Anti-Money Laundering (AML), these duplicates often represent the very cases that need to be identified and flagged.
- Removing duplicates may inadvertently eliminate important examples of fraudulent behavior that the model needs to learn from.

**Using Raw Data with Duplicates**:
- When training your model, it is generally advisable to include the raw data with duplicates. This allows the model to learn the patterns associated with both legitimate and potentially fraudulent transactions.
- The presence of duplicates can help the model understand the characteristics of transactions that are likely to be flagged as duplicates or fraudulent.

**Labeling Duplicates**:
- Instead of removing duplicates, we can label them appropriately. For instance, we created a binary target variable (e.g., `IsDuplicate`) that indicates whether a transaction is a duplicate.
- This approach enables the model to differentiate between legitimate transactions and those that are likely to be fraudulent or erroneous.

### Recommended Approach:

**Keep Duplicates**: Use the dataset with duplicates included for training your Logistic Regression model. This will provide the model with a more comprehensive view of the data.

**Labeling**: Ensure that you have a clear labeling strategy for your target variable. For example, label duplicates as `1` (fraudulent) and unique transactions as `0` (legitimate).

### Part 2: Updated Data Cleaning:

While it is important to keep duplicates for training, we may still want to perform other cleaning steps, such as handling missing values, correcting data types, and normalizing or scaling numerical features:



In [41]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta

# Load the dataset
data = pd.read_csv('synthetic_transactions.csv')

# Display the first few rows of the dataset
print("Initial Data:")
print(data.head(2))

# Check for missing values
print("\nMissing Values:")
print(data.isnull().sum())

# Handle missing values (if any)
# Fill missing values with appropriate methods
# Here, I'll fill missing 'Amount' with the mean, and drop any rows with missing 'UserID' or 'TransactionDate'
data['Amount'].fillna(data['Amount'].mean(), inplace=True)
data.dropna(subset=['UserID', 'TransactionDate'], inplace=True)

# Check for duplicate records
print("\nDuplicate Records Before Removal:")
print(data.duplicated(subset=['UserID', 'Amount'], keep=False).sum())

# Create a new dataset for modeling, keeping duplicates
data_for_modeling = data.copy()

# Convert 'TransactionDate' to datetime
data_for_modeling['TransactionDate'] = pd.to_datetime(data_for_modeling['TransactionDate'])

# Check data types
print("\nData Types After Cleaning:")
print(data_for_modeling.dtypes)

# Reset index after cleaning
data_for_modeling.reset_index(drop=True, inplace=True)

# Create a duplicate label
data_for_modeling['IsDuplicate'] = data_for_modeling.duplicated(subset=['UserID', 'Amount'], keep=False).astype(int)

# Save the new dataset for modeling to a new CSV file
data_for_modeling.to_csv('data_for_modeling.csv', index=False)
print("\nNew dataset for modeling saved to 'data_for_modeling.csv'.")


Initial Data:
   TransactionID  UserID       Amount             TransactionDate  \
0            262      90  1230.955467  2025-06-01 01:50:17.991587   
1           1142      35  1111.905376  2025-05-29 02:45:17.991105   

  TransactionType    Location     Status  IsDuplicate  
0          salary  Location_B  completed            1  
1        purchase  Location_C  completed            1  

Missing Values:
TransactionID      0
UserID             0
Amount             0
TransactionDate    0
TransactionType    0
Location           0
Status             0
IsDuplicate        0
dtype: int64

Duplicate Records Before Removal:
524

Data Types After Cleaning:
TransactionID               int64
UserID                      int64
Amount                    float64
TransactionDate    datetime64[ns]
TransactionType            object
Location                   object
Status                     object
IsDuplicate                 int64
dtype: object

New dataset for modeling saved to 'data_for_modeling.csv'.

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  data['Amount'].fillna(data['Amount'].mean(), inplace=True)


### Key points on updated data cleaning code:

- It loads the synthetic dataset and checks for missing values.
- It handles missing values by filling the `Amount` column with the mean and dropping rows with missing `UserID` or `TransactionDate`.
- It checks for duplicate records and creates a new dataset for modeling while retaining duplicates.
- The `TransactionDate` is converted to a datetime format, and the data types are displayed after cleaning.
- Finally, the cleaned dataset is saved to a new CSV file for further modeling:




## Part 3: Building a Logistic Regression Model for AML

**Steps:**

1. **Load the Updated Cleaned Dataset**: Import the dataset prepared for modeling.
2. **Prepare Features and Target**: Define the features (independent variables) and the target (dependent variable).
3. **Split the Dataset**: Divide the dataset into training and testing sets to evaluate model performance.
4. **Fit the Logistic Regression Model**: Train the model using the training set.
5. **Make Predictions**: Use the trained model to make predictions on the test set.
6. **Evaluate the Model**: Assess the model's performance using metrics such as precision, recall, and F1-score.




### Data Preprocessing and Checking the Distribution of Data

In [43]:
from sklearn.model_selection import train_test_split

# Load the cleaned dataset
data = pd.read_csv('data_for_modeling.csv')

# Prepare features and target
X = data[['UserID', 'Amount']]
y = data['IsDuplicate']

# Check the distribution of the target variable
print("Target Variable Distribution:")
print(y.value_counts())

# Ensure there are samples of both classes
if y.value_counts().min() < 1:
    print("Not enough samples of both classes to train the model.")
else:
    # Split the dataset
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)



Target Variable Distribution:
IsDuplicate
0    776
1    524
Name: count, dtype: int64



### Key Points:
- The code loads the cleaned dataset and prepares the features (`X`) and target variable (`y`).
- It checks the distribution of the target variable to ensure that there are samples of both classes (duplicates and non-duplicates).
- If there are insufficient samples of either class, a warning message is printed.
- If the distribution is adequate, the dataset is split into training and testing sets using a 70-30 split.


### Building the model

In [44]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix

# Fit Logistic Regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Calculate degrees of freedom
num_features = X.shape[1]
degrees_of_freedom = num_features + 1  # Including intercept
print(f'Degrees of Freedom: {degrees_of_freedom}')



Classification Report:
              precision    recall  f1-score   support

           0       0.60      1.00      0.75       234
           1       0.00      0.00      0.00       156

    accuracy                           0.60       390
   macro avg       0.30      0.50      0.38       390
weighted avg       0.36      0.60      0.45       390


Confusion Matrix:
[[234   0]
 [156   0]]
Degrees of Freedom: 3


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



### Interpretation of Degrees of Freedom

In this logistic regression model, there are 2 features: `UserID` and `Amount`. The degrees of freedom is 3, which accounts for the 2 features plus 1 intercept. This value is important for understanding model complexity and conducting statistical tests:

- **Model Complexity**: Degrees of freedom indicate the model's complexity. A higher number means more parameters, allowing for a better fit to the training data but increasing the risk of overfitting.

- **Statistical Testing**: Degrees of freedom are used in hypothesis testing to determine critical values from statistical distributions, such as the Chi-squared distribution. They help assess the significance of model parameters, especially in likelihood ratio tests comparing nested models.

- **Model Evaluation**: Degrees of freedom can be used with metrics like AIC or BIC to evaluate model fit. Models with fewer parameters are generally preferred if they provide a similar fit, as they are less likely to overfit.

