# Data Collection and Feature Generation for PrimoGPT Training

This notebook is responsible for collecting and processing stock data that will be used to train the PrimoGPT model. The process involves:

1. Downloading historical stock data, news, and press releases using Finnhub and Yahoo Finance APIs
2. Processing this data through GPT-4 to generate labeled features for training
3. Saving both processed data in structured formats

## Key Parameters
- `is_for_train=True`: Enables special prompt template that includes future price information
- `custom_gpt=False`: Uses GPT-4o instead of custom PrimoGPT model for feature generation (in this moment we don't have custom model)

## Process Flow
1. Downloads stock price data and calculates returns
2. Fetches relevant news and press releases
3. Processes each day's data through GPT-4 to generate features like:
   - News relevance (0-2)
   - Sentiment (-1 to 1)
   - Price impact potential (-3 to 3)
   - Trend direction (-1 to 1)
   - Earnings impact (-2 to 2)
   - Investor confidence (-3 to 3)
   - Risk profile change (-2 to 2)

## Related Files
- `primogpt/prepare_data.py`: Handles data collection and preprocessing
- `primogpt/create_prompt.py`: Manages prompt engineering and GPT interactions

In [None]:
# Import required modules and set up paths
import sys
sys.path.append('../../')

import json
import os

from primogpt.create_prompt import *
from primogpt.prepare_data import *

In [None]:
# Define stock symbol and date range for data collection
stock_symbol = "AMD"
start_date = "2021-09-01"
end_date = "2024-07-31"

# Create directory for storing data
data_dir = f"data/{stock_symbol}_{start_date}_{end_date}"
os.makedirs(data_dir, exist_ok=True)

In [None]:
# Downloads and processes raw stock data
prepare_data_for_symbol(stock_symbol, data_dir, start_date, end_date)

In [None]:
# Define output CSV filename
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)

# Load and display raw data for verification
df = pd.read_csv(csv_file_path)
df.head(50)

In [None]:
# Display sample news content to verify data quality
news_content = df.loc[1, 'News']
news_content_json = json.loads(news_content) 

print("News:")
for news_item in news_content_json:
    print(f"Date: {news_item['date']}, Headline: {news_item['headline']}, Summary: {news_item['summary']}\n")

In [None]:
# Display sample press releases to verify data quality
press_releases = df.loc[1, 'PressReleases']
press_releases_json = json.loads(press_releases)

print("Press Releases:")
for release in press_releases_json:
    print(f"Date: {release['date']}, Headline: {release['headline']}, Description: {release['description']}\n")

In [None]:
# Generates features using GPT-4o and saves results to CSV
results = process_stock_data(stock_symbol, data_dir, start_date, end_date, is_for_train=True, custom_gpt=False)

In [None]:
# Load and display processed data with GPT-generated features
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}_gpt.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)
df = pd.read_csv(csv_file_path)
df.head(50)