# Data Merging and Filtering for PrimoGPT Training

This notebook handles the merging and filtering of GPT-generated features from multiple stock datasets. It processes JSON files containing GPT-4's responses and prepares them for training the PrimoGPT model.

## Key Functions
1. Loading and merging multiple JSON files containing GPT responses
2. Parsing and validating GPT-generated feature values
3. Filtering out invalid or zero-value samples
4. Saving the cleaned, merged dataset

## Process Flow
1. Loads all JSON files from the data directory
2. Parses GPT responses and validates feature values
3. Filters out samples where all features are zero or invalid
4. Saves the merged and filtered dataset for model training

## Data Quality Checks
- Validates JSON response format
- Ensures non-zero feature values
- Converts all feature values to consistent string format

In [None]:
import json
import os
from pathlib import Path
import re

In [None]:
# Function to load and combine all JSON files from a directory
def load_json_files(directory):
    data = []
    for filename in os.listdir(directory):
        if filename.endswith('.json'):
            with open(os.path.join(directory, filename), 'r') as file:
                data.extend(json.load(file))
    return data

# Function to parse GPT response and extract feature values
def parse_response(response_str):
    match = re.search(r'\{.*\}', response_str)
    if match:
        try:
            response_dict = json.loads(match.group())
            # Convert all values to strings and replace None with '0'
            return {k: str(v) if v is not None else '0' for k, v in response_dict.items()}
        except json.JSONDecodeError:
            return None
    return None

# Function to filter samples based on feature values
def filter_samples(data):
    filtered_data = []
    removed_samples = []
    
    for sample in data:
        response = parse_response(sample['response'])
        # Keep sample if at least one feature has non-zero value
        if response and any(int(value) != 0 if value is not None else False for value in response.values()):
            filtered_data.append(sample)
        else:
            removed_samples.append(sample)
    
    return filtered_data, removed_samples

In [None]:
# Load all JSON files from the @data directory
data_directory = Path('data')
all_samples = load_json_files(data_directory)

# Filter samples
#filtered_samples, removed_samples = filter_samples(all_samples)

In [None]:
# Print results
print(f"Total samples before filtering: {len(all_samples)}")
print(f"Total samples after filtering: {len(filtered_samples)}")
print(f"Number of samples removed: {len(removed_samples)}")

In [None]:
# Save filtered samples to a new JSON file
output_file = data_directory / 'filtered_samples.json'
with open(output_file, 'w') as f:
    json.dump(all_samples, f, indent=2)

print(f"\nFiltered samples saved to {output_file}")