# Data Collection and Feature Generation for PrimoGPT Training

This notebook is responsible for collecting and processing stock data that will be used to train the PrimoGPT model. The process involves:

1. Downloading historical stock data, news, and press releases using Finnhub and Yahoo Finance APIs
2. Processing this data through GPT-4 to generate labeled features for training
3. Saving both processed data in structured formats

## Key Parameters
- `is_for_train=True`: Enables special prompt template that includes future price information
- `custom_gpt=False`: Uses GPT-4o instead of custom PrimoGPT model for feature generation (in this moment we don't have custom model)

## Process Flow
1. Downloads stock price data and calculates returns
2. Fetches relevant news and press releases
3. Processes each day's data through GPT-4 to generate features like:
   - News relevance (0-2)
   - Sentiment (-1 to 1)
   - Price impact potential (-3 to 3)
   - Trend direction (-1 to 1)
   - Earnings impact (-2 to 2)
   - Investor confidence (-3 to 3)
   - Risk profile change (-2 to 2)

## Related Files
- `primogpt/prepare_data.py`: Handles data collection and preprocessing
- `primogpt/create_prompt.py`: Manages prompt engineering and GPT interactions

### Note
I manually transferred the generated data from the temporary folder to data_for_train_primogpt for better organization

In [1]:
# Import required modules and set up paths
import sys
sys.path.append('../../')

import json
import os

from primogpt.create_prompt import *
from primogpt.prepare_data import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Define stock symbol and date range for data collection
# This is example for only one stock symbol
stock_symbol = "AMD"
start_date = "2021-09-01"
end_date = "2024-07-31"

# Create directory for storing data
data_dir = f"data/{stock_symbol}_{start_date}_{end_date}"
os.makedirs(data_dir, exist_ok=True)

In [3]:
# Downloads and processes raw stock data
prepare_data_for_symbol(stock_symbol, data_dir, start_date, end_date)

[*********************100%***********************]  1 of 1 completed


News done
Press releases done


Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
1,2021-09-02,109.199997,-0.007182,D1,[],[]
2,2021-09-03,109.919998,0.006593,U1,[],[]
3,2021-09-07,109.150002,-0.007005,D1,[],[]
4,2021-09-08,106.169998,-0.027302,D3,[],[]
5,2021-09-09,106.150002,-0.000188,D1,[],[]
...,...,...,...,...,...,...
726,2024-07-24,144.630005,-0.060844,D5+,"[{""date"": ""20240724160029"", ""headline"": ""Why b...",[]
727,2024-07-25,138.320007,-0.043629,D5,"[{""date"": ""20240725160001"", ""headline"": ""AMD I...",[]
728,2024-07-26,139.990005,0.012073,U2,"[{""date"": ""20240726160000"", ""headline"": ""Forge...",[]
729,2024-07-29,139.750000,-0.001714,D1,"[{""date"": ""20240729181730"", ""headline"": ""Analy...",[]


In [5]:
# Define output CSV filename
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)

# Load and display raw data for verification
df = pd.read_csv(csv_file_path)
df.tail(50)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
680,2024-05-17,164.470001,0.011376,U2,"[{""date"": ""20240517160002"", ""headline"": ""Intel...",[]
681,2024-05-20,166.330002,0.011309,U2,"[{""date"": ""20240520164500"", ""headline"": ""Fed T...",[]
682,2024-05-21,164.660004,-0.01004,D2,"[{""date"": ""20240521173000"", ""headline"": ""AMD I...","[{""date"": ""2024-05-21 09:00:00"", ""headline"": ""..."
683,2024-05-22,165.520004,0.005223,U1,"[{""date"": ""20240522172500"", ""headline"": ""AMD E...",[]
684,2024-05-23,160.429993,-0.030752,D4,"[{""date"": ""20240523193622"", ""headline"": ""Nvidi...",[]
685,2024-05-24,166.360001,0.036963,U4,"[{""date"": ""20240524182514"", ""headline"": ""AMD C...",[]
686,2024-05-28,171.610001,0.031558,U4,"[{""date"": ""20240528163000"", ""headline"": ""Predi...","[{""date"": ""2024-05-28 17:22:00"", ""headline"": ""..."
687,2024-05-29,165.139999,-0.037702,D4,"[{""date"": ""20240529160012"", ""headline"": ""Is AM...",[]
688,2024-05-30,166.75,0.009749,U1,"[{""date"": ""20240530173102"", ""headline"": ""Advan...",[]
689,2024-05-31,166.899994,0.0009,U1,"[{""date"": ""20240531173146"", ""headline"": ""Why I...","[{""date"": ""2024-06-02 22:59:00"", ""headline"": ""..."


In [9]:
# Display sample news content to verify data quality
news_content = df.loc[155, 'News']
news_content_json = json.loads(news_content) 

print("News:")
for news_item in news_content_json:
    print(f"Date: {news_item['date']}, Headline: {news_item['headline']}, Summary: {news_item['summary']}\n")

News:
Date: 20220414183600, Headline: Advanced Micro Devices Inc. stock underperforms Thursday when compared to competitors, Summary: Shares of Advanced Micro Devices Inc. slipped 4.79% to $93.06 Thursday, on what proved to be an all-around grim trading session for the stock market, with...

Date: 20220414183800, Headline: Here’s why retail investors will come back to crypto, despite Fed rate hikes, says chief executive at eToro, Summary: A weekly look at the most important moves and news in crypto and what's on the horizon in digital assets.

Date: 20220414184727, Headline: Top Millennial, Gen Z Stock Picks Shift To Energy Stocks; But Tesla Still No. 1, Summary: Millennials and Gen Z are investing earlier than previous generations. What are their top stock picks? And should you buy them too?

Date: 20220414213033, Headline: Samsung Overtook Intel As Top Chip Seller In 2021 Thanks To Automotive, Smartphones, Summary: The global semiconductor revenue reached $595 billion in 2021, up 26.

In [10]:
# Display sample press releases to verify data quality
press_releases = df.loc[155, 'PressReleases']
press_releases_json = json.loads(press_releases)

print("Press Releases:")
for release in press_releases_json:
    print(f"Date: {release['date']}, Headline: {release['headline']}, Description: {release['description']}\n")

Press Releases:
Date: 2022-04-14 16:15:00, Headline: AMD to Report Fiscal First Quarter 2022 Financial Results, Description: Management will conduct a conference call to discuss these results at 5:00 p.m. EST/ 2:00 p.m. PST. AMD, the AMD Arrow logo and the combination thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners.. Contact Drew Prairie AMD Communications 512-602-4425...



In [11]:
# Generates features using GPT-4o and saves results to CSV
results = process_stock_data(stock_symbol, data_dir, start_date, end_date, is_for_train=True, custom_gpt=False)

100%|██████████| 729/729 [17:56<00:00,  1.48s/it]


In [12]:
# Load and display processed data with GPT-generated features
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}_gpt.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)
df = pd.read_csv(csv_file_path)
df.tail(50)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News Relevance,Sentiment,Price Impact Potential,Trend Direction,Earnings Impact,Investor Confidence,Risk Profile Change,Prompt
679,2024-05-16,162.619995,0.018476,U2,2,1,1,1,1,2,0,Human: \n [SYSTEM PROMPT]\n You are a se...
680,2024-05-17,164.470001,0.011376,U2,2,1,2,1,1,2,0,Human: \n [SYSTEM PROMPT]\n You are a se...
681,2024-05-20,166.330002,0.011309,U2,1,0,-1,-1,0,-1,0,Human: \n [SYSTEM PROMPT]\n You are a se...
682,2024-05-21,164.660004,-0.01004,D2,2,1,1,1,1,1,0,Human: \n [SYSTEM PROMPT]\n You are a se...
683,2024-05-22,165.520004,0.005223,U1,1,0,-2,-1,0,-1,0,Human: \n [SYSTEM PROMPT]\n You are a se...
684,2024-05-23,160.429993,-0.030752,D4,1,0,0,0,0,0,0,Human: \n [SYSTEM PROMPT]\n You are a se...
685,2024-05-24,166.360001,0.036963,U4,2,1,2,1,2,2,0,Human: \n [SYSTEM PROMPT]\n You are a se...
686,2024-05-28,171.610001,0.031558,U4,2,0,-2,-1,1,0,0,Human: \n [SYSTEM PROMPT]\n You are a se...
687,2024-05-29,165.139999,-0.037702,D4,1,0,1,1,1,1,0,Human: \n [SYSTEM PROMPT]\n You are a se...
688,2024-05-30,166.75,0.009749,U1,2,0,0,1,0,0,-1,Human: \n [SYSTEM PROMPT]\n You are a se...
