# Marketing Misreporting and Anomaly Detection

This notebook aims to identify several types of misreporting and anomalies in the marketing and revenue data. We will analyze:
1. **Revenue Inflation**: Marketing revenue significantly higher than finance revenue.
2. **Missing Invoices**: Revenue reported by marketing but not by finance.
3. **Duplicate Attribution**: The same user being attributed to multiple revenue events.
4. **Attribution Leakage**: Users who click/engage but do not convert.
5. **Outlier Spend Days**: Days with unusually high marketing spend.

We will then visualize these anomalies in a Plotly dashboard and assign severity flags.

## 1. Load and Prepare Data

In [None]:
import pandas as pd
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

spend_df = pd.read_csv('../data/marketing_spend.csv')
events_df = pd.read_csv('../data/funnel_events.csv')
revenue_marketing_df = pd.read_csv('../data/revenue_marketing.csv')
revenue_finance_df = pd.read_csv('../data/revenue_finance.csv')

# Convert date columns to datetime
spend_df['date'] = pd.to_datetime(spend_df['date'])
events_df['timestamp'] = pd.to_datetime(events_df['timestamp'])
revenue_marketing_df['date'] = pd.to_datetime(revenue_marketing_df['date'])
revenue_finance_df['date'] = pd.to_datetime(revenue_finance_df['date'])

print('Data Loaded and Prepared.')

## 2. Detect Anomalies

### 2.1 Revenue Inflation (>20% Variance)

In [None]:
mkt_revenue_agg = revenue_marketing_df.groupby('date')['revenue'].sum().reset_index().rename(columns={'revenue': 'mkt_revenue'})
fin_revenue_agg = revenue_finance_df.groupby('date')['revenue'].sum().reset_index().rename(columns={'revenue': 'fin_revenue'})

revenue_comp = pd.merge(mkt_revenue_agg, fin_revenue_agg, on='date', how='outer').fillna(0)
revenue_comp['variance'] = (revenue_comp['mkt_revenue'] - revenue_comp['fin_revenue']) / revenue_comp['fin_revenue']

inflated_revenue = revenue_comp[revenue_comp['variance'] > 0.2].copy()
inflated_revenue['anomaly_type'] = 'Revenue Inflation'
inflated_revenue['severity'] = 'High'
inflated_revenue.loc[inflated_revenue['variance'] > 0.5, 'severity'] = 'Critical'

print('Revenue Inflation Detection Complete.')

### 2.2 Missing Invoices

In [None]:
missing_invoices = revenue_comp[revenue_comp['fin_revenue'] == 0].copy()
missing_invoices['anomaly_type'] = 'Missing Invoice'
missing_invoices['severity'] = 'High'

print('Missing Invoice Detection Complete.')

### 2.3 Duplicate Attribution
**Assumption**: To detect this, we need to link users to revenue. We will merge `events_df` with `revenue_marketing_df` assuming a revenue event happens on the same day as a 'checkout' event for a user.

In [None]:
events_df['date'] = events_df['timestamp'].dt.date
events_df['date'] = pd.to_datetime(events_df['date'])
checkout_events = events_df[events_df['event_type'] == 'checkout']
user_revenue = pd.merge(checkout_events, revenue_marketing_df, on='date')
duplicate_attribution = user_revenue.groupby('user_id').filter(lambda x: len(x) > 1).copy()
duplicate_attribution['anomaly_type'] = 'Duplicate Attribution'
duplicate_attribution['severity'] = 'Medium'

print('Duplicate Attribution Detection Complete.')

### 2.4 Attribution Leakage

In [None]:
converted_users = user_revenue['user_id'].unique()
leaked_users = events_df[~events_df['user_id'].isin(converted_users)]
attribution_leakage = leaked_users[leaked_users['event_type'] == 'add_to_cart'].copy()
attribution_leakage = attribution_leakage.drop_duplicates(subset=['user_id'])
attribution_leakage['anomaly_type'] = 'Attribution Leakage'
attribution_leakage['severity'] = 'Medium'

print('Attribution Leakage Detection Complete.')

### 2.5 Outlier Spend Days

In [None]:
spend_mean = spend_df['spend'].mean()
spend_std = spend_df['spend'].std()
outlier_threshold = spend_mean + 2 * spend_std

outlier_spend = spend_df[spend_df['spend'] > outlier_threshold].copy()
outlier_spend['anomaly_type'] = 'Outlier Spend'
outlier_spend['severity'] = 'High'

print('Outlier Spend Detection Complete.')

## 3. Anomaly Dashboard

In [None]:
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=(
        'Revenue Inflation (>20%)', 'Outlier Spend Days',
        'Missing Invoices', 'Duplicate Attributions',
        'Attribution Leakage'
    ),
    specs=[[{'type': 'bar'}, {'type': 'scatter'}],
           [{'type': 'table'}, {'type': 'bar'}],
           [{'type': 'table'}, {}]]
)

# Revenue Inflation
fig.add_trace(go.Bar(x=inflated_revenue['date'], y=inflated_revenue['variance'], name='Variance', marker_color='red'), row=1, col=1)

# Outlier Spend
fig.add_trace(go.Scatter(x=spend_df['date'], y=spend_df['spend'], mode='lines', name='Spend'), row=1, col=2)
fig.add_trace(go.Scatter(x=outlier_spend['date'], y=outlier_spend['spend'], mode='markers', name='Outlier', marker=dict(color='red', size=10)), row=1, col=2)

# Missing Invoices
fig.add_trace(go.Table (
    header=dict(values=['Date', 'Marketing Revenue', 'Severity']),
    cells=dict(values=[missing_invoices['date'], missing_invoices['mkt_revenue'], missing_invoices['severity']])
    ),
row=2, col=1)

# Duplicate Attributions
dup_counts = duplicate_attribution['user_id'].value_counts()
fig.add_trace(go.Bar(x=dup_counts.index, y=dup_counts.values, name='Duplicate Count', marker_color='orange'), row=2, col=2)

# Attribution Leakage
fig.add_trace(go.Table(
    header=dict(values=['User ID', 'Last Event Time', 'Severity']),
    cells=dict(values=[attribution_leakage['user_id'], attribution_leakage['timestamp'], attribution_leakage['severity']])
    ),
row=3, col=1)

fig.update_layout(height=1200, title_text='Marketing Anomaly Dashboard', showlegend=False)

fig.show()