# Stock Data Download and Preparation

This notebook implements the data download and preparation process for stock analysis. The script:
1. Downloads historical stock data from Yahoo Finance
2. Retrieves news and press releases
3. Prepares and formats the data for further processing

## Key Parameters
- `stock_symbol`: Target stock symbol (e.g., "CRM")
- `start_date`: Historical data start date
- `end_date`: Historical data end date

## Data Collection Process
1. Downloads historical stock data:
   - Daily price data (Open, High, Low, Close, Volume)
   - Adjusted closing prices
   - Trading volume
2. Retrieves associated news and press releases:
   - Financial news articles
   - Company press releases

## Output Format
Data is saved in CSV format with columns:
- Date
- Stock price data (Adj Close Price, Returns, Bin Label)
- News articles (timestamp, headline, summary)
- Press releases (timestamp, headline, description)

## Usage
This data will be used as input for the PrimoGPT model to generate NLP features in subsequent analysis steps.

### Note
I manually transferred the generated data from the temporary folder to data_for_train_primogpt for better organization

### This cell bellow is for package installation on Google Colab

In [1]:
# Import required modules and set up paths
import sys
sys.path.append('../../')

import json
import os
from primogpt.create_prompt import *
from primogpt.prepare_data import *

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Define stock symbol and date range
stock_symbol = "AMZN"
start_date = "2022-04-01"
end_date = "2025-02-28"

# Create directory for storing generated features
data_dir = f"data/{stock_symbol}_{start_date}_{end_date}"
os.makedirs(data_dir, exist_ok=True)

In [3]:
# Download and prepare raw data
# This includes stock prices, news, and press releases
prepare_data_for_symbol(stock_symbol, data_dir, start_date, end_date)

[*********************100%***********************]  1 of 1 completed


Returns done
Skipping news item with invalid timestamp: -62135596800
Skipping news item with invalid timestamp: -62135596800
News done
Press releases done


Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
1,2022-04-04,168.346497,0.029264,U3,[],"[{""date"": ""2022-04-04 01:07:00"", ""headline"": ""..."
2,2022-04-05,164.054993,-0.025492,D3,[],"[{""date"": ""2022-04-05 07:00:00"", ""headline"": ""..."
3,2022-04-06,158.755997,-0.032300,D4,[],"[{""date"": ""2022-04-06 06:01:00"", ""headline"": ""..."
4,2022-04-07,157.784500,-0.006119,D1,[],"[{""date"": ""2022-04-07 10:00:00"", ""headline"": ""..."
5,2022-04-08,154.460495,-0.021067,D3,"[{""date"": ""20220410041718"", ""headline"": ""Elon ...",[]
...,...,...,...,...,...,...
724,2025-02-21,216.580002,-0.028266,D3,"[{""date"": ""20250221164515"", ""headline"": ""Super...",[]
725,2025-02-24,212.710007,-0.017869,D2,"[{""date"": ""20250224183840"", ""headline"": ""Selli...",[]
726,2025-02-25,212.800003,0.000423,U1,"[{""date"": ""20250225202117"", ""headline"": ""Alpha...",[]
727,2025-02-26,214.350006,0.007284,U1,"[{""date"": ""20250226162551"", ""headline"": ""ACCO ...",[]


In [4]:
# Load the raw data file
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)

# Display the first 50 rows of the data
df = pd.read_csv(csv_file_path)
df.head(50)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
0,2022-04-04,168.346497,0.029264,U3,[],"[{""date"": ""2022-04-04 01:07:00"", ""headline"": ""..."
1,2022-04-05,164.054993,-0.025492,D3,[],"[{""date"": ""2022-04-05 07:00:00"", ""headline"": ""..."
2,2022-04-06,158.755997,-0.0323,D4,[],"[{""date"": ""2022-04-06 06:01:00"", ""headline"": ""..."
3,2022-04-07,157.7845,-0.006119,D1,[],"[{""date"": ""2022-04-07 10:00:00"", ""headline"": ""..."
4,2022-04-08,154.460495,-0.021067,D3,"[{""date"": ""20220410041718"", ""headline"": ""Elon ...",[]
5,2022-04-11,151.121994,-0.021614,D3,"[{""date"": ""20220411161916"", ""headline"": ""Marke...",[]
6,2022-04-12,150.787506,-0.002213,D1,"[{""date"": ""20220412160023"", ""headline"": ""Major...","[{""date"": ""2022-04-12 16:48:00"", ""headline"": ""..."
7,2022-04-13,155.541,0.031524,U4,"[{""date"": ""20220413164844"", ""headline"": ""Amazo...","[{""date"": ""2022-04-13 10:00:00"", ""headline"": ""..."
8,2022-04-14,151.706497,-0.024653,D3,"[{""date"": ""20220414160000"", ""headline"": ""CNH I...","[{""date"": ""2022-04-14 16:01:00"", ""headline"": ""..."
9,2022-04-18,152.785004,0.007109,U1,"[{""date"": ""20220418160102"", ""headline"": ""Alpha...",[]


In [5]:
# Display sample data to verify content
news_content = df.loc[155, 'News']
news_content_json = json.loads(news_content)

print("News:")
for news_item in news_content_json:
    print(f"Date: {news_item['date']}, Headline: {news_item['headline']}, Summary: {news_item['summary']}\n")

News:
Date: 20221114163143, Headline: EXG: Great Price But Portfolio Needs Fixes, Summary: EXG invests in a portfolio of global stocks and uses an options strategy in an attempt to generate a high level of income for its investors.

Date: 20221114190053, Headline: Amazon: Time To Buy The Dip, Summary: After two years of being in a bull market, Amazon's stock price has eventually fallen to pre-covid levels. Click here to read my analysis of AMZN stock.

Date: 20221114191257, Headline: Amazon 'primed' to lay off thousands of workers this week, Summary: Amazon announces it plans to lay of thousands of employees this week. With CNBC's Melissa Lee and the Fast Money traders, Tim Seymour, Guy Adami, Karen Finerman and Dan Nathan.

Date: 20221114191758, Headline: Xi-Biden meeting was progress, but will not change competition narrative: Longview Global's McNeal, Summary: Dewardric McNeal of Longview Global on what Biden's meeting with Xi means for U.S.-China relations. With CNBC's Melissa Lee 

In [6]:
# Display sample data to verify content
press_releases = df.loc[156, 'PressReleases']
press_releases_json = json.loads(press_releases)

print("Press Releases:")
for release in press_releases_json:
    print(f"Date: {release['date']}, Headline: {release['headline']}, Description: {release['description']}\n")

Press Releases:
Date: 2022-11-15 09:00:00, Headline: Amazon Books Editors Announce 2022’s Best Books of the Year, Description: Today, the Amazon Books Editors announced their selections for the Best Books of 2022, naming Gabrielle Zevin’ s novel Tomorrow, and Tomorrow, and Tomorrow as the Best Book of the Year. The annual list is hand-picked by a team of editors who read thousands of books each year and share their recommendations on Amazon Book Review to help customers find their next...

