# Compare One Day of Data

Created by Mitas Ray on 2024-12-09.

Last edited by Mitas Ray on 2024-12-09.

This notebook investigates three anomalies based on a single day of data for two datasets: (1) old data used for automated training, and (2) new data to be used for automated training. The anomalies are: (1) the data in the sets have minor differences, (2) the same model has vastly different predictions on these two datasets, (3) a model that was trained on just two epochs has very similar accuracy to a model that was trained on 100 epochs. The procedure is to 
1. download both datasets and isolate the single day
2. download the archived models
3. compare the accuracy results of the archived models on these datasets

To run the notebook, use Python 3.10 (Python 3.12 does not work), and
- on linux: use `ficc_python/requirements_py310_linux_jupyter.txt`
- on mac: use `ficc_python/requirements_py310_mac_jupyter.txt`

Note: This notebook **requires at least 50 GB of RAM** on a VM. On a MacBook Pro M1 Max with 32 GB of RAM, it can still run because swap memory allows the system to use additional storage space to supplement the required memory.

Change the following files and/or variables to enable credentials and the correct directories:
- `automated_training_auxiliary_functions.py::get_creds(...)` to be the location of the credentials file
- `automated_training_auxiliary_variables.py::WORKING_DIRECTORY` to be the location of the old working directory

In [None]:
# loads the autoreload extension
%load_ext autoreload
# automatically reloads all imported modules when their source code changes
%autoreload 2

In [2]:
import os

import pandas as pd
from tensorflow import keras


# importing from parent directory: https://stackoverflow.com/questions/714063/importing-modules-from-parent-folder
import sys
sys.path.insert(0, '../../')


from automated_training_auxiliary_variables import WORKING_DIRECTORY, CATEGORICAL_FEATURES
from automated_training_auxiliary_functions import STORAGE_CLIENT, fit_encoders, create_input, load_model, create_summary_of_results
from ficc.utils.gcp_storage_functions import download_data

INFO: Pandarallel will run on 5 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Initialized pandarallel with 5 cores
INFO: Pandarallel will run on 5 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Initialized pandarallel with 5 cores
INFO: Pandarallel will run on 5 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Initialized pandarallel with 5 cores
In PRODUCTION mode (to change to TESTING mode, set `TESTING` to `True`); all files and models will be saved and NUM_EPOCHS=100
INFO: Pandarallel will run on 5 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.
Initialized pandarallel with 5 cores


In [3]:
AUTOMATED_TRAINING_BUCKET = 'automated_training'

In [4]:
MODEL = 'yield_spread_with_similar_trades'

In [5]:
def get_data_for_automated_training_and_isolate_to_single_dates(old_or_new: str, dates: list) -> pd.DataFrame:
    assert old_or_new in ('old', 'new')
    df_list = []
    df_downloaded_from_google_cloud_storage = {}
    for date in dates:
        pickle_file_path = f'{WORKING_DIRECTORY}/files/{old_or_new}_data_{date}.pkl'
        if os.path.exists(pickle_file_path):
            print(f'Loading pickle file from {pickle_file_path}')
            df_list.append(pd.read_pickle(pickle_file_path))
        else:
            print(f'Could not find pickle file in {pickle_file_path}, so creating it now')
            suffix = '' if old_or_new == 'old' else '_v2'
            google_cloud_storage_file_name =  f'processed_data_yield_spread_with_similar_trades{suffix}.pkl'
            if google_cloud_storage_file_name not in df_downloaded_from_google_cloud_storage:
                df = download_data(STORAGE_CLIENT, AUTOMATED_TRAINING_BUCKET, google_cloud_storage_file_name)
                df_downloaded_from_google_cloud_storage[google_cloud_storage_file_name] = df
            else:
                df = df_downloaded_from_google_cloud_storage[google_cloud_storage_file_name]
            
            df = df[df['trade_date'] == date]
            df.to_pickle(pickle_file_path)
            df_list.append(df)
    return df_list if len(df_list) > 1 else df_list[0]

In [6]:
old_df_on_2024_12_06 = get_data_for_automated_training_and_isolate_to_single_dates('old', ['2024-12-06'])
new_df_on_2024_12_06 = get_data_for_automated_training_and_isolate_to_single_dates('new', ['2024-12-06'])

Loading pickle file from /Users/mitas/ficc/ficc_python/notebooks/compare_datasets/files/old_data_2024-12-06.pkl
Loading pickle file from /Users/mitas/ficc/ficc_python/notebooks/compare_datasets/files/new_data_2024-12-06.pkl


### Anomaly 1: different data in the two datasets
Check which RTRS control numbers are differing.  
Conclusion: there will always be RTRS control numbers present in one data set and not in the other (and vice versa) because we exclude trades based on certain conditions in the reference data (see `automated_training_auxiliary_functions.py::get_data_query(...)` and `automated_training_auxiliary_variables.py::QUERY_CONDITIONS`). The discrepancy is due to the fact that the two data providers define features like `coupon_type` and `capital_type` differently and may report default events differently. Hence, a given trade may meet the exclusion criterion based on one set of reference data but it may not meet the criterion if we look at the other set.

In [7]:
print(f'Number of items in the old df: {len(old_df_on_2024_12_06)}')
print(f'Number of items in the new df: {len(new_df_on_2024_12_06)}')

Number of items in the old df: 58248
Number of items in the new df: 59014


In [8]:
old_df_on_2024_12_06_rtrs_control_numbers = set(old_df_on_2024_12_06['rtrs_control_number'].tolist())
new_df_on_2024_12_06_rtrs_control_numbers = set(new_df_on_2024_12_06['rtrs_control_number'].tolist())

In [9]:
print(f'RTRS control numbers in the old df but not in the new df: {old_df_on_2024_12_06_rtrs_control_numbers - new_df_on_2024_12_06_rtrs_control_numbers}')
print(f'RTRS control numbers in the new df but not in the old df: {new_df_on_2024_12_06_rtrs_control_numbers - old_df_on_2024_12_06_rtrs_control_numbers}')

RTRS control numbers in the old df but not in the new df: {2024120609830400, 2024120601898500, 2024120608315400, 2024120609774600, 2024120611621900, 2024120602891300, 2024120614790700, 2024120609832500, 2024120602015800, 2024120614448700, 2024120606218300, 2024120613168700, 2024120607156800, 2024120606508100, 2024120614567500, 2024120604229200, 2024120600453200, 2024120612549200, 2024120614575700, 2024120601694300, 2024120613969500, 2024120600520800, 2024120602995300, 2024120608106600, 2024120606216300, 2024120610047600, 2024120600519800, 2024120607892600, 2024120604155000, 2024120612222600, 2024120607157900, 2024120605024400, 2024120613169300, 2024120602032800, 2024120609204900, 2024120602164900, 2024120603248300, 2024120608321200, 2024120611354800, 2024120608107700, 2024120614443700, 2024120614575800, 2024120607804600, 2024120615463100, 2024120601694400, 2024120600520900, 2024120604156100, 2024120604497100, 2024120607189200, 2024120612450000, 2024120607687400, 2024120602016500, 20241

### Anomaly 2: same model has vastly different predictions
Make sure that the data is identical in terms of both datasets having the same RTRS control numbers, so that when we make the predictions, we can be confident that there are not a few outlier CUSIPs that are causing the discrepancy.

In [10]:
similar_trades_model_2024_12_06, _ = load_model('2024-12-06', 'yield_spread_with_similar_trades')

BEGIN load_model
Attempting to load model from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06
Model failed to load from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06 with exception: Error executing an HTTP request: HTTP response code 404 with body '<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06</Details></Error>'
	 when reading gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-06
Attempting to load model from gs://automated_training/similar-trades-model-2024-12-06




Model loaded from gs://automated_training/similar-trades-model-2024-12-06
END load_model. Execution time: 0:00:43.779


In [11]:
def create_summary_of_results_for_model(df: pd.DataFrame, model) -> None:
    encoders, _ = fit_encoders(df, CATEGORICAL_FEATURES, MODEL)
    return create_summary_of_results(model, df, *create_input(df, encoders, MODEL), print_results=False)

In [12]:
create_summary_of_results_for_model(old_df_on_2024_12_06, similar_trades_model_2024_12_06)

BEGIN create_input
END create_input. Execution time: 0:00:00.167
 1/59 [..............................] - ETA: 1:16

2024-12-10 11:08:17.762199: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:693] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }




Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,10.895,58248
Dealer-Dealer,10.961,21730
Bid Side / Dealer-Purchase,10.782,16752
Offered Side / Dealer-Sell,10.918,19766
AAA,9.897,8953
Investment Grade,10.445,47445
Trade size >= 100k,9.654,13024
Last trade <= 7 days,9.553,40836
7 days < Last trade <= 14 days,11.627,4164
14 days < Last trade <= 28 days,13.343,5090


In [13]:
create_summary_of_results_for_model(new_df_on_2024_12_06, similar_trades_model_2024_12_06)

BEGIN create_input
END create_input. Execution time: 0:00:00.168


Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,14.761,59014
Dealer-Dealer,14.974,21781
Bid Side / Dealer-Purchase,14.038,16769
Offered Side / Dealer-Sell,15.126,20464
AAA,13.12,8915
Investment Grade,13.929,47749
Trade size >= 100k,15.171,13723
Last trade <= 7 days,13.273,41617
7 days < Last trade <= 14 days,14.818,4160
14 days < Last trade <= 28 days,16.445,5084


Select only rows that have an RTRS control number in both datasets.

In [17]:
rtrs_control_numbers_in_both_old_data_and_new_data = old_df_on_2024_12_06_rtrs_control_numbers & new_df_on_2024_12_06_rtrs_control_numbers
old_df_on_2024_12_06_same_rtrs_control_numbers = old_df_on_2024_12_06[old_df_on_2024_12_06['rtrs_control_number'].isin(rtrs_control_numbers_in_both_old_data_and_new_data)]
new_df_on_2024_12_06_same_rtrs_control_numbers = new_df_on_2024_12_06[new_df_on_2024_12_06['rtrs_control_number'].isin(rtrs_control_numbers_in_both_old_data_and_new_data)]

In [18]:
create_summary_of_results_for_model(old_df_on_2024_12_06_same_rtrs_control_numbers, similar_trades_model_2024_12_06)

BEGIN create_input
END create_input. Execution time: 0:00:00.173


Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,10.89,58146
Dealer-Dealer,10.96,21692
Bid Side / Dealer-Purchase,10.777,16728
Offered Side / Dealer-Sell,10.91,19726
AAA,9.9,8947
Investment Grade,10.436,47359
Trade size >= 100k,9.647,12996
Last trade <= 7 days,9.546,40756
7 days < Last trade <= 14 days,11.619,4159
14 days < Last trade <= 28 days,13.323,5083


In [19]:
create_summary_of_results_for_model(new_df_on_2024_12_06_same_rtrs_control_numbers, similar_trades_model_2024_12_06)

BEGIN create_input
END create_input. Execution time: 0:00:00.162


Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,14.514,58146
Dealer-Dealer,14.881,21692
Bid Side / Dealer-Purchase,14.021,16728
Offered Side / Dealer-Sell,14.529,19726
AAA,13.095,8908
Investment Grade,13.712,47229
Trade size >= 100k,14.243,12996
Last trade <= 7 days,12.933,40756
7 days < Last trade <= 14 days,14.817,4159
14 days < Last trade <= 28 days,16.446,5083


### Anomaly 3: test model has similar accuracy to production model
The v2 yield spread with similar trades model trained on 2024-12-06 used only 2 epochs.

In [28]:
similar_trades_model_2024_12_09, _ = load_model('2024-12-09', 'yield_spread_with_similar_trades')
similar_trades_model_v2_2024_12_09 = keras.models.load_model(os.path.join('gs://'+AUTOMATED_TRAINING_BUCKET, f'similar-trades-v2-model-2024-12-09'))    # create path of the form: <bucket>/<model>

BEGIN load_model
Attempting to load model from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09
Model failed to load from gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09 with exception: Error executing an HTTP request: HTTP response code 404 with body '<?xml version='1.0' encoding='UTF-8'?><Error><Code>NoSuchKey</Code><Message>The specified key does not exist.</Message><Details>No such object: automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09</Details></Error>'
	 when reading gs://automated_training/yield_spread_with_similar_trades_v2_model/similar-trades-model-2024-12-09
Attempting to load model from gs://automated_training/similar-trades-model-2024-12-09




Model loaded from gs://automated_training/similar-trades-model-2024-12-09
END load_model. Execution time: 0:00:39.141




In [34]:
create_summary_of_results_for_model(old_df_on_2024_12_06, similar_trades_model_2024_12_09)

BEGIN create_input
END create_input. Execution time: 0:00:00.158
 1/59 [..............................] - ETA: 1:04

2024-12-09 18:25:21.984458: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:693] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }




Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,10.732,58248
Dealer-Dealer,10.913,21730
Bid Side / Dealer-Purchase,10.587,16752
Offered Side / Dealer-Sell,10.655,19766
AAA,9.588,8953
Investment Grade,10.334,47445
Trade size >= 100k,9.392,13024
Last trade <= 7 days,9.481,40836
7 days < Last trade <= 14 days,11.335,4164
14 days < Last trade <= 28 days,13.279,5090


In [35]:
create_summary_of_results_for_model(new_df_on_2024_12_06, similar_trades_model_v2_2024_12_09)

BEGIN create_input
END create_input. Execution time: 0:00:00.167
 1/60 [..............................] - ETA: 1:06

2024-12-09 18:25:32.008801: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:693] Error in PredictCost() for the op: op: "Softmax" attr { key: "T" value { type: DT_FLOAT } } inputs { dtype: DT_FLOAT shape { unknown_rank: true } } device { type: "CPU" model: "0" frequency: 2400 num_cores: 10 environment { key: "cpu_instruction_set" value: "ARM NEON" } environment { key: "eigen" value: "3.4.90" } l1_cache_size: 16384 l2_cache_size: 524288 l3_cache_size: 524288 memory_size: 268435456 } outputs { dtype: DT_FLOAT shape { unknown_rank: true } }




Unnamed: 0,Mean Absolute Error,Trade Count
Entire set,11.264,59014
Dealer-Dealer,11.3,21781
Bid Side / Dealer-Purchase,11.415,16769
Offered Side / Dealer-Sell,11.101,20464
AAA,10.189,8915
Investment Grade,10.68,47749
Trade size >= 100k,9.303,13723
Last trade <= 7 days,9.853,41617
7 days < Last trade <= 14 days,11.893,4160
14 days < Last trade <= 28 days,13.454,5084
