In [2]:
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.preprocessing import TransactionEncoder
from sklearn.ensemble import IsolationForest
import pandas as pd
import re

# Sample customer complaints data
data = {'complaint_description': ["I received an unsolicited call about a new credit card offer.",
                                  "The automated call was about marketing, and I did not opt in.",
                                  "I got repeated robocalls from a telemarketer.",
                                  "They called me despite being on the do-not-call list."]}
df = pd.DataFrame(data)

# List of TCPA-related keywords
tcp_keywords = [
    "robocall", "telemarketing", "automated call", "unsolicited call", "do-not-call", "spam", 
    "call blocking", "customer consent", "marketing call", "opt-out", "telemarketing list", "call center", 
    "caller ID", "illegal calls", "call volume", "call recording", "repeated calls", "call tracking", 
    "telephone solicitation", "unwanted message", "ringless voicemail", "artificial voice", "pre-recorded message", 
    "contact center", "privacy violation", "unwanted solicitation", "repeated messages", "text messages", "text spam", 
    "telemarketing campaign", "opt-in", "robocall violation", "tcpa fine", "telemarketing regulations", "consumer rights", 
    "marketing database", "cell phone spam", "call harassment", "do-not-call list violation", "tcpa compliance", 
    "call harassment", "call recording laws", "call abandonment", "message spam", "legal call", "outbound call", 
    "automated system", "no-call list", "consumer protection", "pre-recorded voicemail", "ringless voicemail campaign", 
    "spam calls", "voice message spam", "violent call behavior", "do-not-call registry", "prerecorded marketing", 
    "inbound marketing", "call interruption", "commercial call", "cold call", "legal obligations", "communication consent", 
    "call frequency", "consumer complaint", "call tracking tool", "caller harassment", "junk text", "unsolicited fax", 
    "caller script", "telemarketing harassment", "predictive dialing", "call suppression", "marketing compliance", 
    "unsolicited marketing", "call monitoring", "contact violation", "calling system", "number portability", 
    "spam filtering", "do-not-call registry list", "privacy breach", "phone bill scams", "voice message interception", 
    "consumer notification", "repetitive dialing", "contact management", "spam report", "marketing automation", 
    "data protection", "telemarketing sales", "scam calls", "consumer fraud", "do-not-disturb violation", "marketing policy", 
    "call security", "privacy law", "tcpa complaint", "regulatory compliance", "marketing scam", "call automation", 
    "harassing calls", "privacy rights violation", "telemarketing text", "recorded message", "outbound telemarketer", 
    "automated outbound call", "call center monitoring", "call scheduling", "sales calls", "voice spam", 
    "do-not-call protection", "telemarketing fraud", "tcpa penalty", "call harassment regulation", "marketing laws", 
    "phone marketing", "predictive dialing system", "data security", "consumer privacy", "autodialing system", 
    "call pattern analysis", "spam filtering system", "no-call list violation", "no-consent marketing", "robocall blocker", 
    "voice call regulation", "automated dialing system", "internet-based calling", "sms spam", "phone scam", 
    "blocking robocalls", "tcpa lawsuit", "call not answered", "messaging system", "calling patterns", 
    "automated message delivery", "consent violation", "caller information", "fraudulent calls", "digital marketing", 
    "call center regulation", "unsolicited advertisement", "contact information", "call overload", "consumer alerts", 
    "phone solicitation", "outbound marketing", "consumer complaint resolution", "marketing strategy", "call interruption", 
    "call rejection", "data privacy", "marketing guidelines", "call recording compliance", "tcpa fine structure", 
    "no-contact policy", "invalid call", "caller database", "digital solicitation", "marketing violations", "voice-based marketing", 
    "regulatory violation", "compliance violations", "call rights", "repeated messages", "illegal message", 
    "unsolicited marketing text", "automated marketing", "robocall legislation", "violating calls", "phone-based scams", 
    "blocking unsolicited calls", "marketing fraud", "automated telemarketing", "outbound dialing", "call duration", 
    "unsolicited communication", "telemarketing disclosure", "do-not-call regulation", "compliance tracking", "repeat solicitation", 
    "telecommunication violation", "persistent solicitation", "regulatory fines", "phone solicitation compliance", "spam messaging", 
    "call violations", "automated sales calls", "voicemail spam", "call misrepresentation", "robocall prevention", 
    "telemarketer harassment", "message misdirection", "non-consent solicitation", "message overload", "phone solicitation law", 
    "text communication harassment", "call center regulation compliance", "marketing excess", "automated marketing violation"
]

# Function to extract TCPA-related keywords from complaints
def find_tcp_keywords(complaint, keywords):
    complaint_lower = complaint.lower()
    matches = [keyword for keyword in keywords if re.search(r'\b' + re.escape(keyword) + r'\b', complaint_lower)]
    return matches

# Apply function to extract keywords from complaints
df['tcp_keywords_found'] = df['complaint_description'].apply(lambda x: find_tcp_keywords(x, tcp_keywords))

# Prepare the data for FPGrowth
transactions = df['tcp_keywords_found'].apply(lambda x: x if x else None).dropna().tolist()

# Transaction Encoding for FPGrowth
te = TransactionEncoder()
te_ary = te.fit(transactions).transform(transactions)
df_transactions = pd.DataFrame(te_ary, columns=te.columns_)

# Apply FPGrowth to find frequent itemsets
frequent_itemsets = fpgrowth(df_transactions, min_support=0.1, use_colnames=True)

# Extract itemsets for use as features
frequent_itemsets_list = frequent_itemsets['itemsets'].apply(lambda x: list(x)).tolist()

# Create a feature matrix based on frequent itemsets
def create_feature_matrix(df, frequent_itemsets_list, te_columns):
    # Initialize binary feature columns for frequent itemsets
    for idx, itemset in enumerate(frequent_itemsets_list):
        feature_name = f"pattern_{idx}"
        df[feature_name] = df['tcp_keywords_found'].apply(lambda x: 1 if any(keyword in x for keyword in itemset) else 0)
    return df

# Create the feature matrix
df = create_feature_matrix(df, frequent_itemsets_list, te.columns_)

# Now, df contains the binary features for the frequent itemsets
df_features = df.drop(columns=['complaint_description', 'tcp_keywords_found'])

print(df_features)


   pattern_0  pattern_1  pattern_2
0          1          0          0
1          0          1          0
2          0          0          0
3          0          0          1


In [3]:
import pandas as pd
import numpy as np

# Number of rows in the dummy dataset
n = 1000

# Generating random data for the datasets
np.random.seed(42)

# Transaction Dataset
transaction_data = {
    'transaction_id': np.arange(1, n+1),
    'transaction_amount': np.random.uniform(50, 1000, size=n),
    'loan_cap_amt': np.random.uniform(100, 5000, size=n),
    'ca_fee_amt': np.random.uniform(1, 150, size=n),
    'net_interest_amt': np.random.uniform(0, 100, size=n),
    'total_credit_limit': np.random.uniform(500, 10000, size=n),
    'trans_type_id': np.random.choice([1, 60, 80], size=n),  # 1: Normal, 60: Cash Advance, 80: Other
    'total_net_fees_amt': np.random.uniform(0, 50, size=n),
}

# Customer Experience Dataset
customer_experience_data = {
    'delnq_stat_cur_mon': np.random.choice([0, 1, 2], size=n),  # 0: No Delinquency, 1: Delinquent, 2: Severe Delinquent
    'payment_history': np.random.choice([0, 1], size=n),  # 0: No Issues, 1: Issues with payments
    'account_age': np.random.randint(1, 20, size=n),  # Random account age between 1 and 20 years
}

# Billing Statement Dataset
billing_statement_data = {
    'fees': np.random.uniform(0, 200, size=n),
    'payment_amount': np.random.uniform(50, 1000, size=n),
    'due_balance': np.random.uniform(50, 5000, size=n),
    'total_balance_due': np.random.uniform(100, 10000, size=n),
}

# Regulatory Classification Dataset
regulatory_classification_data = {
    'regulatory_classfication_1': np.random.choice(['A', 'B', 'C'], size=n),
    'regulatory_classfication_2': np.random.choice(['X', 'Y', 'Z'], size=n),
}

# Combine all datasets into one DataFrame
df = pd.DataFrame({**transaction_data, **customer_experience_data, **billing_statement_data, **regulatory_classification_data})

# Feature Engineering

# Loan-to-Transaction Ratio
df['loan_to_transaction_ratio'] = df['loan_cap_amt'] / df['transaction_amount']

# High Fee Ratio
df['high_fee_ratio'] = df['ca_fee_amt'] / df['transaction_amount']

# Credit Utilization Ratio
df['credit_utilization_ratio'] = df['transaction_amount'] / df['total_credit_limit']

# Over-Credit Limit Fee Indicator
df['over_credit_limit_fee'] = (df['transaction_amount'] > df['total_credit_limit']).astype(int)

# APR Misuse Flag
df['apr_misuse_flag'] = (df['ca_fee_amt'] > (0.1 * df['transaction_amount'])).astype(int)

# Interest-to-Loan Cap Ratio
df['interest_to_loan_cap_ratio'] = df['net_interest_amt'] / df['loan_cap_amt']

# Fee-to-Credit Ratio
df['fee_to_credit_ratio'] = df['ca_fee_amt'] / df['total_credit_limit']

# Regulatory Classification Interaction
df['regulatory_class_interaction'] = df['regulatory_classfication_1'].astype(str) + '_' + df['regulatory_classfication_2'].astype(str)

# High-Interest Frequency
df['high_interest_frequency'] = (df['net_interest_amt'] > 100).astype(int)

# Cash Advance Flag
df['cash_advance_flag'] = (df['trans_type_id'] == 60).astype(int)

# Loan Cap Adjustment Frequency (simulated data)
df['loan_cap_adjustment_frequency'] = np.random.choice([0, 1], size=n)

# Cash Advance-to-Transaction Ratio
df['cash_advance_ratio'] = df['ca_fee_amt'] / (df['ca_fee_amt'] + df['transaction_amount'])

# Total Fees-to-Transaction Amount Ratio
df['fees_to_transaction_ratio'] = df['total_net_fees_amt'] / df['transaction_amount']

# Show the first few rows of the dataset
print(df.head())


   transaction_id  transaction_amount  loan_cap_amt  ca_fee_amt  \
0               1          405.813113   1007.151351   39.994147   
1               2          953.178591   2755.314642   37.799841   
2               3          745.394245   4377.434596  136.031932   
3               4          618.725560   3687.901943   38.182384   
4               5          198.217708   4052.149625   41.520509   

   net_interest_amt  total_credit_limit  trans_type_id  total_net_fees_amt  \
0         67.270299         5933.960844             80           31.417356   
1         79.668140         8151.607128             80            9.205066   
2         25.046790         7721.528833             80            5.282517   
3         62.487410         1962.049095             60           40.642549   
4         57.174598         1917.869963             80           28.937824   

   delnq_stat_cur_mon  payment_history  ...  over_credit_limit_fee  \
0                   2                1  ...               

In [5]:
df_fpgrowth.head()

Unnamed: 0,trans_type_id,ca_fee_amt,loan_cap_amt,net_interest_amt,transaction_amount
0,80,39.994147,1007.151351,67.270299,405.813113
1,80,37.799841,2755.314642,79.66814,953.178591
2,80,136.031932,4377.434596,25.04679,745.394245
3,60,38.182384,3687.901943,62.48741,618.72556
4,80,41.520509,4052.149625,57.174598,198.217708


In [10]:
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Example of preparing data for FP-Growth
# Let's assume you have a dataframe `df` with numerical and categorical columns.
# Convert numerical columns to binary (1 for presence, 0 for absence)

# Example thresholding for numeric columns
df['transaction_amount_binary'] = (df['transaction_amount'] > 1000).astype(int)  # e.g., 1 if transaction amount > 1000, else 0
df['ca_fee_amt_binary'] = (df['ca_fee_amt'] > 50).astype(int)  # e.g., 1 if fee > 50, else 0

# Convert categorical columns using get_dummies (one-hot encoding)
df_fpgrowth = df[['transaction_amount_binary', 'ca_fee_amt_binary']]

# Apply FP-Growth algorithm with min_support
frequent_itemsets = fpgrowth(df_fpgrowth, min_support=0.05, use_colnames=True)

# Generate association rules from frequent itemsets
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Extract itemsets and create new features for each frequent pattern
new_features = pd.DataFrame()

# For each frequent itemset, create binary features (1 or 0) indicating the presence of this itemset
for index, row in frequent_itemsets.iterrows():
    # Create feature name by combining itemset elements
    feature_name = "_and_".join([col for col in row['itemsets']])

    # Add binary feature column for this itemset
    new_features[feature_name] = df_fpgrowth.apply(lambda x: all(x[col] == 1 for col in row['itemsets']), axis=1).astype(int)

# For each rule in the association rules, create a feature representing the lift or confidence of the rule
for index, row in rules.iterrows():
    # Create feature name based on the antecedent (left-hand side) and consequent (right-hand side)
    antecedent = "_and_".join([col for col in row['antecedents']])
    consequent = "_and_".join([col for col in row['consequents']])

    # Feature names based on lift and confidence
    lift_feature = f"lift_{antecedent}_to_{consequent}"
    confidence_feature = f"confidence_{antecedent}_to_{consequent}"

    # Create the feature based on lift and confidence values
    new_features[lift_feature] = row['lift']
    new_features[confidence_feature] = row['confidence']

# Combine the new features with the original dataframe
df_features = pd.concat([df, new_features], axis=1)

# Display the newly created features
print(df_features.head())


   transaction_id  transaction_amount  loan_cap_amt  ca_fee_amt  \
0               1          405.813113   1007.151351   39.994147   
1               2          953.178591   2755.314642   37.799841   
2               3          745.394245   4377.434596  136.031932   
3               4          618.725560   3687.901943   38.182384   
4               5          198.217708   4052.149625   41.520509   

   net_interest_amt  total_credit_limit  trans_type_id  total_net_fees_amt  \
0         67.270299         5933.960844             80           31.417356   
1         79.668140         8151.607128             80            9.205066   
2         25.046790         7721.528833             80            5.282517   
3         62.487410         1962.049095             60           40.642549   
4         57.174598         1917.869963             80           28.937824   

   delnq_stat_cur_mon  payment_history  ...  fee_to_credit_ratio  \
0                   2                1  ...             0.00



In [None]:
1. Transaction Amount vs. Loan Amount Ratio:
python
Copy code
df['trans_loan_ratio'] = df['transaction_amount'] / df['loan_cap_amt']
Why?: This ratio gives an indication of the scale of the transaction in relation to the loan amount. If the transaction amount is unusually high compared to the loan limit, it could suggest that the APR or loan conditions are being manipulated or misrepresented, violating RegZ.
How it helps: Detects cases where transactions are disproportionate to the loan amount, potentially indicating improper APR disclosures or terms.
2. Fee to Transaction Amount Ratio:
python
Copy code
df['fee_trans_ratio'] = df['ca_fee_amt'] / df['transaction_amount']
Why?: RegZ requires accurate fee disclosures for loans and credit. This ratio shows the proportion of fees relative to the transaction amount. If this ratio is too high, it could suggest hidden or undisclosed fees that violate RegZ requirements for transparency.
How it helps: Identifies instances where fee amounts might not be properly disclosed, indicating potential RegZ compliance issues.
3. Transaction Type Frequency:
python
Copy code
df['transaction_type_frequency'] = df.groupby('trans_type_id')['trans_type_id'].transform('count')
Why?: Certain transaction types might trigger higher APRs or fee structures. For example, loans involving cash advances might carry higher APRs than regular purchases.
How it helps: By analyzing the frequency of transaction types, you can detect patterns where certain transaction types are more prevalent than expected, indicating potential RegZ violations related to non-disclosure of APR rates for specific types of transactions.
4. Loan Cap Amount to Fee Ratio:
python
Copy code
df['loan_fee_ratio'] = df['loan_cap_amt'] / df['ca_fee_amt']
Why?: This ratio can indicate if the fees are reasonable in relation to the loan cap. RegZ requires that the loan fees and APR be disclosed accurately. If fees are disproportionately high compared to the loan amount, it could violate RegZ rules.
How it helps: Helps detect overly high fees compared to loan amounts, which could indicate hidden APR costs, or misrepresentation of APR in loan agreements.
5. APR Change Over Time (Interest Rate Trend):
python
Copy code
df['apr_change'] = df['net_interest_amt'].diff().fillna(0)
Why?: RegZ requires that interest rates, fees, and APR disclosures be consistent and transparent. Tracking changes in APR over time can help identify irregular adjustments that might be non-compliant.
How it helps: Detects sudden or unexplained changes in APR or interest rates, which might violate RegZ if not properly disclosed to consumers.
6. Loan-to-Value (LTV) Ratio:
python
Copy code
df['ltv_ratio'] = df['loan_cap_amt'] / df['transaction_amount']
Why?: The LTV ratio is an important factor in determining the risk of a loan. In RegZ, any changes in loan terms (such as APR) based on the LTV ratio need to be clearly disclosed. High LTV ratios might indicate riskier loans, which could warrant a higher APR.
How it helps: Identifies situations where high LTV ratios might lead to higher APRs that are not properly disclosed, violating RegZ rules.
7. Interest Amount to Loan Cap Ratio:
python
Copy code
df['interest_loan_ratio'] = df['net_interest_amt'] / df['loan_cap_amt']
Why?: This ratio helps identify if the interest charged on a loan is disproportionate to the loan cap. Under RegZ, the interest rate and charges must be clear, and if this ratio is high, it could indicate an APR disclosure issue.
How it helps: Detects excessively high interest relative to the loan cap, which could indicate undisclosed APR violations.
8. Transaction Amount vs. Net Interest Amount Ratio:
python
Copy code
df['trans_interest_ratio'] = df['transaction_amount'] / df['net_interest_amt']
Why?: This ratio helps to gauge the transparency of APR disclosures. If the transaction amount is significantly higher than the net interest amount, it may indicate that the APR is being disclosed in a misleading way.
How it helps: Highlights situations where APR disclosures might be manipulated or misrepresented, potentially violating RegZ.
9. Fee Amount Relative to Loan Term:
python
Copy code
df['fee_term_ratio'] = df['ca_fee_amt'] / df['loan_term_months']
Why?: RegZ also governs the total cost of credit, including fees over the term of the loan. This ratio helps identify if fees are excessively high compared to the length of the loan.
How it helps: Identifies situations where fees are disproportionate to the loan term, which could indicate hidden APR costs that violate RegZ.
10. Change in Loan Cap Amount Over Time:
python
Copy code
df['loan_cap_change'] = df['loan_cap_amt'].diff().fillna(0)
Why?: RegZ requires that the terms of the loan, including the loan cap, be disclosed clearly. A sudden or unannounced change in the loan cap could be a violation of RegZ.
How it helps: Helps detect any unexpected changes in loan caps that might not be disclosed properly to the consumer, indicating a possible RegZ violation.
11. Fee Frequency per Transaction Type:
python
Copy code
df['fee_trans_type_frequency'] = df.groupby(['trans_type_id', 'ca_fee_amt']).transform('count')['transaction_amount']
Why?: Some transaction types may be more likely to incur higher fees. RegZ requires that these fees and APRs be clearly disclosed. A pattern of fees being charged for specific transaction types could be a red flag.
How it helps: Identifies instances where certain transaction types are associated with disproportionately high fees, which could be in violation of RegZ's requirement for transparent APR disclosures.
How These Features Help in the APR and RegZ Use Case:
RegZ Transparency Requirements: These features allow you to detect any signs of non-compliance with Regulation Z's requirements for clear and accurate APR disclosures, fee transparency, and loan term visibility.
Identifying Irregular Patterns: By creating ratios and examining changes in loan amounts, fees, APR, and transaction amounts, you can identify unusual or suspicious patterns that may indicate hidden or improperly disclosed APR terms, fees, or loan conditions.
Compliance Monitoring: These features allow you to track whether fees, interest rates, and loan amounts remain within the bounds of what is legally permissible under RegZ, ensuring that customers are being treated fairly and are not exposed to excessive or undisclosed costs.

In [11]:
from mlxtend.frequent_patterns import fpgrowth, association_rules
import pandas as pd

# Sample binarized dataframe with transaction and loan data
df = pd.DataFrame({
    'transaction_amount': [100, 2000, 3000, 15000, 10000],
    'ca_fee_amt': [10, 60, 120, 80, 90],
    'loan_cap_amt': [5000, 10000, 15000, 25000, 20000],
    'net_interest_amt': [50, 200, 250, 300, 500],
    'loan_term_months': [12, 24, 36, 48, 60],
    'trans_type_id': ['credit_purchase', 'cash_advance', 'credit_purchase', 'cash_advance', 'credit_purchase']
})

# Preprocess data to create binary columns
df['transaction_amount_binary'] = (df['transaction_amount'] > 1000).astype(int)
df['ca_fee_amt_binary'] = (df['ca_fee_amt'] > 50).astype(int)
df['loan_cap_amt_binary'] = (df['loan_cap_amt'] > 10000).astype(int)
df['net_interest_amt_binary'] = (df['net_interest_amt'] > 200).astype(int)

# Use FP-Growth to detect frequent itemsets
df_fpgrowth = df[['transaction_amount_binary', 'ca_fee_amt_binary', 'loan_cap_amt_binary', 'net_interest_amt_binary']]
frequent_itemsets = fpgrowth(df_fpgrowth, min_support=0.1, use_colnames=True)

# Extract and display the frequent itemsets
print(frequent_itemsets)

# Generate association rules from the frequent itemsets
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)
print(rules)


    support                                           itemsets
0       0.8                                (ca_fee_amt_binary)
1       0.8                        (transaction_amount_binary)
2       0.6                          (net_interest_amt_binary)
3       0.6                              (loan_cap_amt_binary)
4       0.8     (transaction_amount_binary, ca_fee_amt_binary)
5       0.6       (net_interest_amt_binary, ca_fee_amt_binary)
6       0.6           (loan_cap_amt_binary, ca_fee_amt_binary)
7       0.6  (transaction_amount_binary, net_interest_amt_b...
8       0.6   (transaction_amount_binary, loan_cap_amt_binary)
9       0.6     (loan_cap_amt_binary, net_interest_amt_binary)
10      0.6  (transaction_amount_binary, net_interest_amt_b...
11      0.6  (transaction_amount_binary, loan_cap_amt_binar...
12      0.6  (loan_cap_amt_binary, net_interest_amt_binary,...
13      0.6  (transaction_amount_binary, loan_cap_amt_binar...
14      0.6  (transaction_amount_binary, loan_cap_amt_b



In [None]:
Here are some FP-Growth-based features tailored for the APR and RegZ use case:

1. Transaction Amount + Fee Patterns
python
Copy code
df['trans_fee_association'] = (df['transaction_amount'] > 1000) & (df['ca_fee_amt'] > 50)
Why?: Frequent itemsets between transaction amount and fees above certain thresholds could indicate unusual fee structures or hidden charges that violate RegZ rules. If larger transaction amounts are associated with disproportionately high fees, this could be indicative of non-disclosure.
How it helps: Identifies potentially misleading fee structures for large transactions that could violate RegZ if not properly disclosed.
2. Transaction Amount + Loan Cap + Interest Amount Patterns
python
Copy code
df['trans_loan_interest_association'] = (df['transaction_amount'] > 1000) & (df['loan_cap_amt'] > 5000) & (df['net_interest_amt'] > 200)
Why?: The combination of high transaction amounts, large loan caps, and high interest amounts may indicate loans with high APRs. RegZ requires that these rates be transparently disclosed.
How it helps: Identifies high-risk transactions where loan terms may be non-compliant or APR is not clearly disclosed.
3. Transaction Type + Fee Amount Association
python
Copy code
df['trans_type_fee_association'] = (df['trans_type_id'] == 'cash_advance') & (df['ca_fee_amt'] > 50)
Why?: Certain transaction types, such as cash advances, are often subject to higher fees. RegZ mandates that these fee structures are disclosed clearly. This feature detects whether specific transaction types have unusually high fees.
How it helps: Highlights cases where specific transaction types might carry hidden fees, potentially violating RegZ.
4. Loan Cap Amount + Transaction Amount + Interest Rate Association
python
Copy code
df['loan_trans_interest_association'] = (df['loan_cap_amt'] > 10000) & (df['transaction_amount'] > 5000) & (df['net_interest_amt'] > 300)
Why?: High-value transactions often correlate with higher loan caps and interest rates. These features may represent loans with higher APRs, which need to be disclosed appropriately under RegZ.
How it helps: Flags transactions with large loan caps and high interest amounts, potentially indicating APR violations if the terms are not properly disclosed.
5. Transaction Amount + Loan Term + Fee Amount Patterns
python
Copy code
df['trans_term_fee_association'] = (df['transaction_amount'] > 5000) & (df['loan_term_months'] > 24) & (df['ca_fee_amt'] > 100)
Why?: Loans with longer terms and high fees relative to transaction amounts could suggest improperly disclosed APR or excessive fees, as RegZ requires clear and transparent fee disclosures for longer-term loans.
How it helps: Detects high-fee, long-term loans that might not be compliant with RegZ transparency requirements.
6. Transaction Type + Loan Cap Amount + Interest Amount
python
Copy code
df['trans_type_loan_interest_association'] = (df['trans_type_id'] == 'credit_purchase') & (df['loan_cap_amt'] > 5000) & (df['net_interest_amt'] > 200)
Why?: This association detects patterns where specific transaction types, such as credit purchases, are paired with higher loan caps and interest amounts. If these combinations appear frequently in the data, it might indicate improper APR disclosure under RegZ.
How it helps: Flags transactions where interest and loan terms might be manipulated or not disclosed according to RegZ.
7. Frequent Itemsets Based on Loan Cap and Transaction Amount
python
Copy code
df['loan_trans_association'] = (df['loan_cap_amt'] > 10000) & (df['transaction_amount'] > 2000)
Why?: Combinations of high loan caps and large transaction amounts may signify loans with higher APRs. If such combinations occur frequently, it might be worth examining whether these terms are being disclosed properly.
How it helps: Highlights loans with large transactions and high caps, which are more likely to involve significant APR terms.
8. Fee Amount + Loan Term Association
python
Copy code
df['fee_term_association'] = (df['ca_fee_amt'] > 50) & (df['loan_term_months'] > 24)
Why?: RegZ mandates that fees be properly disclosed over the life of the loan. If long-term loans consistently have higher fees, it might point to hidden or improperly disclosed APRs.
How it helps: Detects patterns where long-term loans are associated with higher fees, which could indicate RegZ violations in APR disclosures.
9. Transaction Amount + Loan Cap + Fee Amount Interaction
python
Copy code
df['trans_loan_fee_interaction'] = (df['transaction_amount'] > 5000) & (df['loan_cap_amt'] > 10000) & (df['ca_fee_amt'] > 100)
Why?: The interaction between high transaction amounts, loan caps, and fees could suggest a problematic structure that may lead to violations of RegZ if APR disclosures are unclear or deceptive.
How it helps: Identifies potentially hidden or undisclosed fees and APR-related issues by analyzing the interactions between high transaction amounts, loan caps, and fees.
Steps to Generate These Features:
Preprocess Data: Ensure that all numerical features (e.g., transaction_amount, loan_cap_amt, net_interest_amt) are binarized or thresholded as needed to create discrete variables. For categorical features (e.g., trans_type_id), perform one-hot encoding.

Apply FP-Growth: Use the fpgrowth function from the mlxtend library on your preprocessed dataset to find frequent itemsets based on selected thresholds (e.g., transaction amounts above a certain threshold).

Generate Association Rules: Apply the association_rules function to extract rules with metrics like lift, support, and confidence. These rules help identify strong relationships between variables that are highly correlated, which can then be transformed into new features.

Create Binary Features: Based on the frequent itemsets, create binary features for each identified itemset. This is done by checking if a transaction contains the items in the itemset and setting the feature value to 1 if it does, and 0 otherwise.

In [13]:
import pandas as pd
from mlxtend.frequent_patterns import fpgrowth, association_rules

# Sample binarized dataframe with transaction and loan data
df = pd.DataFrame({
    'transaction_amount': [100, 2000, 3000, 15000, 10000],
    'ca_fee_amt': [10, 60, 120, 80, 90],
    'loan_cap_amt': [5000, 10000, 15000, 25000, 20000],
    'net_interest_amt': [50, 200, 250, 300, 500],
    'loan_term_months': [12, 24, 36, 48, 60],
    'trans_type_id': ['credit_purchase', 'cash_advance', 'credit_purchase', 'cash_advance', 'credit_purchase']
})

# Preprocess data to create binary columns
df['transaction_amount_binary'] = (df['transaction_amount'] > 1000).astype(int)
df['ca_fee_amt_binary'] = (df['ca_fee_amt'] > 50).astype(int)
df['loan_cap_amt_binary'] = (df['loan_cap_amt'] > 10000).astype(int)
df['net_interest_amt_binary'] = (df['net_interest_amt'] > 200).astype(int)
df['loan_term_binary'] = (df['loan_term_months'] > 24).astype(int)
df['trans_type_credit_purchase'] = (df['trans_type_id'] == 'credit_purchase').astype(int)
df['trans_type_cash_advance'] = (df['trans_type_id'] == 'cash_advance').astype(int)

# Prepare data for FP-Growth
df_fpgrowth = df[['transaction_amount_binary', 'ca_fee_amt_binary', 'loan_cap_amt_binary', 'net_interest_amt_binary', 
                  'loan_term_binary', 'trans_type_credit_purchase', 'trans_type_cash_advance']]

# Apply FP-Growth to find frequent itemsets
frequent_itemsets = fpgrowth(df_fpgrowth, min_support=0.1, use_colnames=True)

# Generate association rules based on lift and confidence
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.0)

# Create new features based on the association rules
new_features = pd.DataFrame()

# For each rule, create a feature
for index, row in rules.iterrows():
    # Create feature name based on the antecedents and consequents of the rule
    antecedents = "_and_".join([f"{item}" for item in row['antecedents']])  # Handle antecedents as a set of items
    consequents = "_and_".join([f"{item}" for item in row['consequents']])  # Handle consequents as a set of items
    
    # Feature name based on antecedent -> consequent
    feature_name = f"{antecedents}_to_{consequents}_lift_{row['lift']}_confidence_{row['confidence']}"
    
    # Apply the rule to create a binary feature (1 if rule holds, else 0)
    new_features[feature_name] = df_fpgrowth.apply(
        lambda x: all(x[item] == 1 for item in row['antecedents']) and all(x[item] == 1 for item in row['consequents']),
        axis=1
    ).astype(int)

# Combine the new features with the original dataframe
df_features = pd.concat([df, new_features], axis=1)

# Display the newly created features
print(df_features.head())


  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(


   transaction_amount  ca_fee_amt  loan_cap_amt  net_interest_amt  \
0                 100          10          5000                50   
1                2000          60         10000               200   
2                3000         120         15000               250   
3               15000          80         25000               300   
4               10000          90         20000               500   

   loan_term_months    trans_type_id  transaction_amount_binary  \
0                12  credit_purchase                          0   
1                24     cash_advance                          1   
2                36  credit_purchase                          1   
3                48     cash_advance                          1   
4                60  credit_purchase                          1   

   ca_fee_amt_binary  loan_cap_amt_binary  net_interest_amt_binary  ...  \
0                  0                    0                        0  ...   
1                  1            

  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(
  new_features[feature_name] = df_fpgrowth.apply(


In [None]:
For RegZ compliance and APR (Annual Percentage Rate) calculations, we need to focus on identifying thresholds and statistical features that could help assess if transactions are adhering to the regulations set out by Regulation Z. These thresholds can be based on statistical analyses that highlight unusual or non-compliant behaviors in the data.

Here are some threshold-based features that can be derived using statistical methods:

1. Transaction Amount Relative to Loan Cap
Feature Name: transaction_to_loan_cap_ratio

Why it helps: According to RegZ, the APR is influenced by loan amounts and their fees. If the transaction amount exceeds a certain percentage of the loan cap, it could indicate that the consumer might be exposed to unreasonably high interest rates, affecting the APR calculation.
Statistical Analysis: We can calculate the mean and standard deviation of this ratio to understand what constitutes a "normal" relationship.
Formula:

transaction_to_loan_cap_ratio
=
transaction_amount
loan_cap_amt
transaction_to_loan_cap_ratio= 
loan_cap_amt
transaction_amount
​
 
Threshold:

Calculate the mean and standard deviation of this ratio.
Flag transactions where the ratio exceeds mean + 2×std or falls below mean - 2×std, indicating an unusually high or low transaction amount relative to the loan cap.
python
Copy code
mean_ratio = df['transaction_amount'] / df['loan_cap_amt'].mean()
std_ratio = df['transaction_amount'] / df['loan_cap_amt'].std()
df['transaction_to_loan_cap_ratio'] = df['transaction_amount'] / df['loan_cap_amt']
df['outlier_transaction_to_loan'] = (df['transaction_to_loan_cap_ratio'] > (mean_ratio + 2 * std_ratio)) | (df['transaction_to_loan_cap_ratio'] < (mean_ratio - 2 * std_ratio))
Statistical Justification: Any value that falls outside of mean ± 2×std is considered a significant outlier, implying potential APR miscalculations or non-compliance with RegZ’s prescribed guidelines.

2. Loan Fee to Loan Amount Ratio
Feature Name: loan_fee_to_loan_ratio

Why it helps: RegZ requires clear disclosures of fees associated with loans. A fee that is excessively high compared to the loan amount could be a red flag for non-compliance or deceptive APR reporting.
Statistical Analysis: Calculate the mean and standard deviation for the fee-to-loan ratio to identify transactions where the fees are disproportionately high or low.
Formula:

loan_fee_to_loan_ratio
=
ca_fee_amt
loan_cap_amt
loan_fee_to_loan_ratio= 
loan_cap_amt
ca_fee_amt
​
 
Threshold:

Calculate the mean and standard deviation of the fee-to-loan ratio.
Transactions that exceed mean + 3×std or fall below mean - 3×std could indicate potential compliance issues with fees as a percentage of the loan.
python
Copy code
mean_fee_ratio = df['ca_fee_amt'].mean() / df['loan_cap_amt'].mean()
std_fee_ratio = df['ca_fee_amt'].std() / df['loan_cap_amt'].std()
df['loan_fee_to_loan_ratio'] = df['ca_fee_amt'] / df['loan_cap_amt']
df['outlier_loan_fee'] = (df['loan_fee_to_loan_ratio'] > (mean_fee_ratio + 3 * std_fee_ratio)) | (df['loan_fee_to_loan_ratio'] < (mean_fee_ratio - 3 * std_fee_ratio))
Statistical Justification: Identifying fee-to-loan ratios that fall outside the range of mean ± 3×std helps pinpoint cases where fees could be disproportionately high or low, which may not align with RegZ rules.

3. Interest Rate Outlier Detection
Feature Name: interest_rate_outlier

Why it helps: RegZ mandates transparency in interest rates charged on loans. If the interest rate charged on a loan is disproportionately high relative to the loan amount, it could point to non-compliance with APR calculation requirements.
Statistical Analysis: Using mean and standard deviation to detect outliers in the interest rate.
Formula:

interest_rate_outlier
=
net_interest_amt
loan_cap_amt
interest_rate_outlier= 
loan_cap_amt
net_interest_amt
​
 
Threshold:

Calculate the mean and standard deviation for the interest rate.
Flag interest rates where the ratio exceeds mean + 3×std or falls below mean - 3×std.
python
Copy code
df['interest_rate_outlier'] = df['net_interest_amt'] / df['loan_cap_amt']
mean_interest_rate = df['interest_rate_outlier'].mean()
std_interest_rate = df['interest_rate_outlier'].std()
df['outlier_interest_rate'] = (df['interest_rate_outlier'] > (mean_interest_rate + 3 * std_interest_rate)) | (df['interest_rate_outlier'] < (mean_interest_rate - 3 * std_interest_rate))
Statistical Justification: An unusually high interest rate compared to the loan cap could indicate RegZ non-compliance, and identifying these outliers will help in flagging potentially problematic transactions.

4. Loan Term to Loan Amount Ratio
Feature Name: loan_term_to_loan_ratio

Why it helps: Longer loan terms may result in higher overall APR due to accumulated interest, fees, and costs over time. This feature helps determine if the loan term is unusually short or long compared to the loan amount.
Statistical Analysis: The mean and standard deviation of the loan term to loan amount ratio can help detect anomalous terms in relation to the loan value.
Formula:

loan_term_to_loan_ratio
=
loan_term_months
loan_cap_amt
loan_term_to_loan_ratio= 
loan_cap_amt
loan_term_months
​
 
Threshold:

Calculate the mean and standard deviation of this ratio.
Flag loans with a ratio exceeding mean + 2×std or falling below mean - 2×std.
python
Copy code
df['loan_term_to_loan_ratio'] = df['loan_term_months'] / df['loan_cap_amt']
mean_loan_term_ratio = df['loan_term_to_loan_ratio'].mean()
std_loan_term_ratio = df['loan_term_to_loan_ratio'].std()
df['outlier_loan_term_ratio'] = (df['loan_term_to_loan_ratio'] > (mean_loan_term_ratio + 2 * std_loan_term_ratio)) | (df['loan_term_to_loan_ratio'] < (mean_loan_term_ratio - 2 * std_loan_term_ratio))
Statistical Justification: Identifying loans where the loan term is disproportionately high or low relative to the loan amount is useful in RegZ APR calculations. Longer loan terms may increase APR due to compounded interest, while shorter terms may not provide sufficient time to repay.
    
    
    
5. Fee-to-Interest Ratio
Feature Name: fee_to_interest_ratio

Why it helps: Under RegZ, fees and interest rates must be disclosed transparently, and their relationship should be reasonable. A high ratio of fees to interest could indicate a potential attempt to hide excessive fees under the guise of a lower interest rate.
Statistical Analysis: Calculate the mean and standard deviation of the fee-to-interest ratio to identify transactions where fees are disproportionately high compared to interest payments.
Formula:

fee_to_interest_ratio
=
ca_fee_amt
net_interest_amt
fee_to_interest_ratio= 
net_interest_amt
ca_fee_amt
​
 
Threshold:

Calculate the mean and standard deviation of this ratio.
Flag transactions where the ratio exceeds mean + 3×std or falls below mean - 3×std.
python
Copy code
df['fee_to_interest_ratio'] = df['ca_fee_amt'] / df['net_interest_amt']
mean_fee_to_interest_ratio = df['fee_to_interest_ratio'].mean()
std_fee_to_interest_ratio = df['fee_to_interest_ratio'].std()
df['outlier_fee_to_interest_ratio'] = (df['fee_to_interest_ratio'] > (mean_fee_to_interest_ratio + 3 * std_fee_to_interest_ratio)) | (df['fee_to_interest_ratio'] < (mean_fee_to_interest_ratio - 3 * std_fee_to_interest_ratio))
Statistical Justification: A significantly high fee-to-interest ratio may suggest that the fees are high relative to the interest, which could be misleading to consumers and may violate RegZ guidelines.

6. APR Deviation from Loan Cap
Feature Name: APR_deviation_from_loan_cap

Why it helps: APR should be proportional to the loan cap. A large deviation could indicate that APR is not properly calculated, or that there are hidden fees or other costs inflating the APR.
Statistical Analysis: Calculate the mean and standard deviation of APR deviations from the loan cap.
Formula:

APR_deviation_from_loan_cap
=
APR
loan_cap_amt
APR_deviation_from_loan_cap= 
loan_cap_amt
APR
​
 
Threshold:

Calculate the mean and standard deviation of this ratio.
Flag transactions where the APR deviates significantly from the loan cap using thresholds like mean ± 2×std.
python
Copy code
df['APR_deviation_from_loan_cap'] = df['APR'] / df['loan_cap_amt']
mean_APR_deviation = df['APR_deviation_from_loan_cap'].mean()
std_APR_deviation = df['APR_deviation_from_loan_cap'].std()
df['outlier_APR_deviation'] = (df['APR_deviation_from_loan_cap'] > (mean_APR_deviation + 2 * std_APR_deviation)) | (df['APR_deviation_from_loan_cap'] < (mean_APR_deviation - 2 * std_APR_deviation))
Statistical Justification: A significant APR deviation from the loan cap can indicate misleading APR calculations or non-compliance with RegZ rules on APR transparency.

7. Balance Transfer Fee Relative to Balance Transfer Amount
Feature Name: balance_transfer_fee_to_balance_ratio

Why it helps: RegZ requires clear disclosure of fees associated with balance transfers. A high balance transfer fee relative to the amount being transferred can lead to a higher-than-expected APR, affecting consumer decision-making.
Statistical Analysis: Calculate the mean and standard deviation of the balance transfer fee to balance transfer amount ratio to identify unusually high fees.
Formula:

balance_transfer_fee_to_balance_ratio
=
balance_transfer_fee
balance_transfer_amount
balance_transfer_fee_to_balance_ratio= 
balance_transfer_amount
balance_transfer_fee
​
 
Threshold:

Calculate the mean and standard deviation of this ratio.
Flag transactions where the ratio exceeds mean + 2×std or falls below mean - 2×std.
python
Copy code
df['balance_transfer_fee_to_balance_ratio'] = df['balance_transfer_fee'] / df['balance_transfer_amount']
mean_balance_transfer_ratio = df['balance_transfer_fee_to_balance_ratio'].mean()
std_balance_transfer_ratio = df['balance_transfer_fee_to_balance_ratio'].std()
df['outlier_balance_transfer_fee'] = (df['balance_transfer_fee_to_balance_ratio'] > (mean_balance_transfer_ratio + 2 * std_balance_transfer_ratio)) | (df['balance_transfer_fee_to_balance_ratio'] < (mean_balance_transfer_ratio - 2 * std_balance_transfer_ratio))
Statistical Justification: High balance transfer fees relative to the amount being transferred might indicate that a consumer is being charged an excessive fee, which may lead to a violation of RegZ requirements.

8. Late Fee Relative to Outstanding Balance
Feature Name: late_fee_to_balance_ratio

Why it helps: Late fees that are disproportionately high compared to the outstanding balance could signal non-compliance with RegZ’s transparency rules. Such fees can increase the APR or lead to unfair charges on consumers.
Statistical Analysis: Calculate the mean and standard deviation of the late fee to outstanding balance ratio to detect unusually high late fees.
Formula:

late_fee_to_balance_ratio
=
late_fee_amt
outstanding_balance
late_fee_to_balance_ratio= 
outstanding_balance
late_fee_amt
​
 
Threshold:

Calculate the mean and standard deviation of this ratio.
Flag transactions where the ratio exceeds mean + 3×std or falls below mean - 3×std.
python
Copy code
df['late_fee_to_balance_ratio'] = df['late_fee_amt'] / df['outstanding_balance']
mean_late_fee_ratio = df['late_fee_to_balance_ratio'].mean()
std_late_fee_ratio = df['late_fee_to_balance_ratio'].std()
df['outlier_late_fee'] = (df['late_fee_to_balance_ratio'] > (mean_late_fee_ratio + 3 * std_late_fee_ratio)) | (df['late_fee_to_balance_ratio'] < (mean_late_fee_ratio - 3 * std_late_fee_ratio))
Statistical Justification: Late fees that are unusually high relative to the outstanding balance can lead to a high APR, signaling potential non-compliance with RegZ’s fee disclosure and APR calculation requirements.

9. Fee-to-Loans Paid Ratio
Feature Name: fee_to_loans_paid_ratio

Why it helps: High fees relative to the loan payments may suggest hidden costs that could inflate the APR, potentially violating RegZ’s transparency rules.
Statistical Analysis: We calculate the mean and standard deviation of this ratio to determine reasonable fee-to-loan payment relationships.
Formula:

fee_to_loans_paid_ratio
=
ca_fee_amt
loan_paid_amt
fee_to_loans_paid_ratio= 
loan_paid_amt
ca_fee_amt
​
 
Threshold:

Calculate the mean and standard deviation of this ratio.
Flag transactions where the ratio exceeds mean + 2×std or falls below mean - 2×std.
python
Copy code
df['fee_to_loans_paid_ratio'] = df['ca_fee_amt'] / df['loan_paid_amt']
mean_fee_to_loans_paid_ratio = df['fee_to_loans_paid_ratio'].mean()
std_fee_to_loans_paid_ratio = df['fee_to_loans_paid_ratio'].std()
df['outlier_fee_to_loans_paid'] = (df['fee_to_loans_paid_ratio'] > (mean_fee_to_loans_paid_ratio + 2 * std_fee_to_loans_paid_ratio)) | (df['fee_to_loans_paid_ratio'] < (mean_fee_to_loans_paid_ratio - 2 * std_fee_to_loans_paid_ratio))
Statistical Justification: Excessive fees in relation to loans paid could signal a lack of transparency and non-compliance with RegZ, particularly if this results in a misleading APR.    

In [None]:
To create a statistical approach for anomaly detection based on the features we’ve discussed (fee-to-interest ratio, APR deviation from loan cap, balance transfer fee relative to balance transfer amount, etc.), we can use statistical methods like Z-scores, IQR (Interquartile Range), and confidence intervals. Here's how to approach anomaly detection based on the 9 rule-based features we derived:

Step-by-Step Approach:
1. Data Preprocessing:
Before applying any anomaly detection methods, ensure that the data is clean and preprocessed.
Remove any null values or outliers that may affect the statistical calculations.
Normalize features where required, especially if features have different units (e.g., fees, amounts, APR).
2. Z-Score Method for Anomaly Detection:
Z-score is a measure of how many standard deviations a data point is from the mean. If the Z-score is large (above a threshold like 2 or 3), it can be considered an anomaly.
For each of the 9 features:

𝑍
=
𝑋
−
𝜇
𝜎
Z= 
σ
X−μ
​
 
where:

𝑋
X is the data point
𝜇
μ is the mean of the feature
𝜎
σ is the standard deviation of the feature
A Z-score > 3 or Z-score < -3 indicates an anomaly (depending on the direction of the feature, it could be either too high or too low).

Implementation Example:

python
Copy code
# Assuming `df` has the columns for all the calculated features:
features = [
    'fee_to_interest_ratio', 'APR_deviation_from_loan_cap', 
    'balance_transfer_fee_to_balance_ratio', 'late_fee_to_balance_ratio',
    'fee_to_loans_paid_ratio', 'balance_transfer_fee_to_balance_ratio', 
    'late_fee_to_balance_ratio', 'APR_deviation_from_loan_cap', 'fee_to_interest_ratio'
]

# Calculate Z-scores for each feature
for feature in features:
    df[f'{feature}_Z'] = (df[feature] - df[feature].mean()) / df[feature].std()

# Flag anomalies with Z-score > 3 or < -3
for feature in features:
    df[f'{feature}_anomaly'] = df[f'{feature}_Z'].apply(lambda x: 1 if abs(x) > 3 else 0)

# Combine anomaly columns into a final anomaly flag
df['anomaly_flag'] = df[[f'{feature}_anomaly' for feature in features]].sum(axis=1)
df['anomaly_flag'] = df['anomaly_flag'].apply(lambda x: 1 if x > 0 else 0)  # If any feature is flagged as anomaly
Statistical Justification: The Z-score method helps flag values that deviate significantly from the mean of the feature distribution, which is a clear indication of anomalies in the data.

3. IQR (Interquartile Range) Method for Anomaly Detection:
The IQR method detects outliers by measuring the spread between the 1st quartile (Q1) and 3rd quartile (Q3) of the data distribution. Values that lie outside the range:
Lower Bound
=
𝑄
1
−
1.5
×
𝐼
𝑄
𝑅
Lower Bound=Q1−1.5×IQR
Upper Bound
=
𝑄
3
+
1.5
×
𝐼
𝑄
𝑅
Upper Bound=Q3+1.5×IQR
where:

𝐼
𝑄
𝑅
=
𝑄
3
−
𝑄
1
IQR=Q3−Q1
Anomalies are those values that lie outside the upper or lower bounds.
Implementation Example:

python
Copy code
for feature in features:
    Q1 = df[feature].quantile(0.25)
    Q3 = df[feature].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    
    # Flag anomalies if values are outside the IQR bounds
    df[f'{feature}_IQR_anomaly'] = df[feature].apply(lambda x: 1 if (x < lower_bound or x > upper_bound) else 0)

# Combine anomaly columns into a final anomaly flag
df['IQR_anomaly_flag'] = df[[f'{feature}_IQR_anomaly' for feature in features]].sum(axis=1)
df['IQR_anomaly_flag'] = df['IQR_anomaly_flag'].apply(lambda x: 1 if x > 0 else 0)
Statistical Justification: The IQR method is based on the assumption that normal data points should lie within a certain range of values. Outliers or anomalies are those that fall outside this range, thus identifying them effectively.

4. Confidence Interval Method for Anomaly Detection:
The Confidence Interval (CI) method calculates the range where a given percentage of the data points should lie. For a 95% Confidence Interval, the range would be mean ± 1.96 * standard deviation.
Implementation Example:

python
Copy code
for feature in features:
    mean_val = df[feature].mean()
    std_val = df[feature].std()
    lower_bound = mean_val - 1.96 * std_val
    upper_bound = mean_val + 1.96 * std_val
    
    # Flag anomalies if values lie outside the confidence interval
    df[f'{feature}_CI_anomaly'] = df[feature].apply(lambda x: 1 if (x < lower_bound or x > upper_bound) else 0)

# Combine anomaly columns into a final anomaly flag
df['CI_anomaly_flag'] = df[[f'{feature}_CI_anomaly' for feature in features]].sum(axis=1)
df['CI_anomaly_flag'] = df['CI_anomaly_flag'].apply(lambda x: 1 if x > 0 else 0)
Statistical Justification: Confidence intervals provide a range where we expect the data to lie with a certain confidence (e.g., 95% confidence). Data points outside this range can be considered anomalies, indicating that something is wrong or unusual.

5. Rule-Based Anomaly Prediction:
Once you apply these methods (Z-score, IQR, Confidence Interval), you can integrate them into a rule-based system where you define anomaly detection based on a combination of these methods. For instance, if any of the anomaly flags for a feature exceed a certain threshold, the transaction can be flagged as anomalous.

python
Copy code
# Combine all anomaly flags into a final decision
df['final_anomaly_flag'] = (df['anomaly_flag'] | df['IQR_anomaly_flag'] | df['CI_anomaly_flag']).astype(int)
Final Model:
Anomaly Flag: If any of the above methods flag a value as anomalous (based on thresholds), the transaction is flagged as anomalous.
Confidence Level: By calculating the Z-score or using IQR thresholds, we can assign a confidence level to each anomaly flag.
6. Evaluation:
Precision, Recall, F1-score: If you have labeled data, evaluate the performance of the anomaly detection using standard classification metrics like precision, recall, and F1-score.
ROC-AUC Curve: If you have a binary classification task, evaluate the model’s performance using ROC-AUC.
Conclusion:
By combining Z-scores, IQR, and confidence intervals, you can create a rule-based anomaly detection system that effectively flags anomalies in transaction data related to RegZ compliance (e.g., APR calculation). These statistical methods offer a solid, interpretable foundation for detecting anomalies that might indicate misleading APR disclosures, excessive fees, or other non-compliance issues, helping to ensure regulatory compliance in financial transactions.