# Number of row Discrepancy Hypothesis Validation
There presists a discrepancy between the num of rows in local and global model evaluation dataframes (as show by the `nrow` function class at the end of `RQ3_analysis-reduced.R`)

My current hypothesis is that this is related to how deduplications and binning are interacting. See [this thread](https://amirlab.slack.com/archives/C07H3A416GZ/p1749924605002809) more details.
- Essentially, binned local models may have less duplicates becauce there is higher bin fidelity when binning is applied to each table seperate.

This notebook hopes to validate that hypothesis by examining the deduplication logic found in `src/datapipeline.py`'s `prep_columns()` function for binned local and global data.

In [11]:
import pandas as pd
import numpy as np

In [12]:
# Construct test DataFrame
data = pd.DataFrame({
    'Start Unix Timestamp': [1000, 1000, 1000, 1000, 2000, 2000, 2000, 2000],
    'TABNAME': ['Table1', 'Table1', 'Table2', 'Table2', 'Table1', 'Table1', 'Table2', 'Table2'],
    'PAGEID': [1, 2, 50, 51, 3, 4, 90, 91]
})
data

Unnamed: 0,Start Unix Timestamp,TABNAME,PAGEID
0,1000,Table1,1
1,1000,Table1,2
2,1000,Table2,50
3,1000,Table2,51
4,2000,Table1,3
5,2000,Table1,4
6,2000,Table2,90
7,2000,Table2,91


In [13]:
def global_method(df, num_bins):
    df = df.copy()
    df['PAGEID_bin'] = pd.cut(df['PAGEID'], bins=num_bins, duplicates='drop')
    df_dedup = df.drop_duplicates(subset=['Start Unix Timestamp', 'TABNAME', 'PAGEID_bin'], keep='first')
    return df_dedup

In [14]:
def local_method(df, num_bins):
    df = df.copy()
    result = []
    for _, group in df.groupby('TABNAME'):
        group['PAGEID_bin'] = pd.cut(group['PAGEID'], bins=num_bins, duplicates='drop')
        group_dedup = group.drop_duplicates(subset=['Start Unix Timestamp', 'TABNAME', 'PAGEID_bin'], keep='first')
        result.append(group_dedup)
    return pd.concat(result)

In [None]:
num_bins = 10

global_result = global_method(data, num_bins)
local_result = local_method(data, num_bins)

# Compare row counts
global_rows = len(global_result)
local_rows = len(local_result)

print("Global method rows:", global_rows)
print("Local method rows:", local_rows)
print("\nGlobal result:")
print(global_result[['Start Unix Timestamp', 'TABNAME', 'PAGEID', 'PAGEID_bin']])
print("\nLocal result:")
print(local_result[['Start Unix Timestamp', 'TABNAME', 'PAGEID', 'PAGEID_bin']])

Global method rows: 4
Local method rows: 6

Global result:
   Start Unix Timestamp TABNAME  PAGEID    PAGEID_bin
0                  1000  Table1       1  (0.91, 10.0]
2                  1000  Table2      50  (46.0, 55.0]
4                  2000  Table1       3  (0.91, 10.0]
6                  2000  Table2      90  (82.0, 91.0]

Local result:
   Start Unix Timestamp TABNAME  PAGEID      PAGEID_bin
0                  1000  Table1       1    (0.997, 1.3]
1                  1000  Table1       2      (1.9, 2.2]
4                  2000  Table1       3      (2.8, 3.1]
5                  2000  Table1       4      (3.7, 4.0]
2                  1000  Table2      50  (49.959, 54.1]
6                  2000  Table2      90    (86.9, 91.0]


Above we can see that the local method has more than the globa because the increased binning fidelity reduces the number of duplicates.