# Stock Data Download and Preparation

This notebook implements the data download and preparation process for stock analysis. The script:
1. Downloads historical stock data from Yahoo Finance
2. Retrieves news and press releases
3. Prepares and formats the data for further processing

## Key Parameters
- `stock_symbol`: Target stock symbol (e.g., "CRM")
- `start_date`: Historical data start date
- `end_date`: Historical data end date

## Data Collection Process
1. Downloads historical stock data:
   - Daily price data (Open, High, Low, Close, Volume)
   - Adjusted closing prices
   - Trading volume
2. Retrieves associated news and press releases:
   - Financial news articles
   - Company press releases

## Output Format
Data is saved in CSV format with columns:
- Date
- Stock price data (Adj Close Price, Returns, Bin Label)
- News articles (timestamp, headline, summary)
- Press releases (timestamp, headline, description)

## Usage
This data will be used as input for the PrimoGPT model to generate NLP features in subsequent analysis steps.

### Note
I manually transferred the generated data from the temporary folder to data_for_train_primogpt for better organization

### This cell bellow is for package installation on Google Colab

In [10]:
# Import required modules and set up paths
import sys
sys.path.append('../../')

import json
import os
from primogpt.create_prompt import *
from primogpt.prepare_data import *

In [11]:
# Define stock symbol and date range
stock_symbol = "NFLX"
start_date = "2022-04-01"
end_date = "2025-02-28"

# Create directory for storing generated features
data_dir = f"data/{stock_symbol}_{start_date}_{end_date}"
os.makedirs(data_dir, exist_ok=True)

In [12]:
# Download and prepare raw data
# This includes stock prices, news, and press releases
prepare_data_for_symbol(stock_symbol, data_dir, start_date, end_date)

[*********************100%***********************]  1 of 1 completed


Returns done
News done
Press releases done


Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
1,2022-04-04,391.500000,0.048277,U5,[],[]
2,2022-04-05,380.149994,-0.028991,D3,[],[]
3,2022-04-06,368.350006,-0.031040,D4,[],[]
4,2022-04-07,362.149994,-0.016832,D2,[],[]
5,2022-04-08,355.880005,-0.017313,D2,"[{""date"": ""20220410084408"", ""headline"": ""Tradi...",[]
...,...,...,...,...,...,...
724,2025-02-21,1003.150024,-0.020878,D3,"[{""date"": ""20250221190600"", ""headline"": ""Comca...",[]
725,2025-02-24,988.469971,-0.014634,D2,[],[]
726,2025-02-25,977.239990,-0.011361,D2,[],[]
727,2025-02-26,990.059998,0.013119,U2,"[{""date"": ""20250227090647"", ""headline"": ""Netfl...","[{""date"": ""2025-02-26 12:00:00"", ""headline"": ""..."


In [13]:
# Load the raw data file
csv_file_name = f"{stock_symbol}_{start_date}_{end_date}.csv"
csv_file_path = os.path.join(data_dir, csv_file_name)

# Display the first 50 rows of the data
df = pd.read_csv(csv_file_path)
df.head(50)

Unnamed: 0,Date,Adj Close Price,Returns,Bin Label,News,PressReleases
0,2022-04-04,391.5,0.048277,U5,[],[]
1,2022-04-05,380.149994,-0.028991,D3,[],[]
2,2022-04-06,368.350006,-0.03104,D4,[],[]
3,2022-04-07,362.149994,-0.016832,D2,[],[]
4,2022-04-08,355.880005,-0.017313,D2,"[{""date"": ""20220410084408"", ""headline"": ""Tradi...",[]
5,2022-04-11,348.0,-0.022142,D3,"[{""date"": ""20220411174500"", ""headline"": ""3 Top...",[]
6,2022-04-12,344.100006,-0.011207,D2,"[{""date"": ""20220412162300"", ""headline"": ""All E...",[]
7,2022-04-13,350.429993,0.018396,U2,"[{""date"": ""20220413170735"", ""headline"": ""Netfl...",[]
8,2022-04-14,341.130005,-0.026539,D3,"[{""date"": ""20220414164602"", ""headline"": ""Netfl...",[]
9,2022-04-18,337.859985,-0.009586,D1,"[{""date"": ""20220418160028"", ""headline"": ""5 Rea...",[]


In [14]:
# Display sample data to verify content
news_content = df.loc[155, 'News']
news_content_json = json.loads(news_content)

print("News:")
for news_item in news_content_json:
    print(f"Date: {news_item['date']}, Headline: {news_item['headline']}, Summary: {news_item['summary']}\n")

News:
Date: 20221114162510, Headline: Is Netflix (NFLX) Still an Attractive Investment Avenue?, Summary: Investment management company LVS Advisory, a New York City-based full-service investment firm, recently released its third-quarter 2022 investor letter. A copy of the same can be downloaded here. The defensive portfolio of the fund gained 2.9% in the quarter. Year-to-date, the portfolio gained 3.3% compared to an 18.5% decline for the Barclays High-Yield Bond Index. In […]

Date: 20221114163200, Headline: How these 22 tech stocks stand out from the pack this earnings season, Summary: Companies including Electronic Arts, ON Semiconductor and Fiserv pulled off a difficult feat in a down cycle --- improving sales and profit margins.

Date: 20221114171500, Headline: The Dow is outperforming, which could be a sign that the latest stock-market rally will flame out, Summary: While investors seek shelter in more defensively oriented equity names, one market technician sees the Dow's growin

In [15]:
# Display sample data to verify content
press_releases = df.loc[156, 'PressReleases']
press_releases_json = json.loads(press_releases)

print("Press Releases:")
for release in press_releases_json:
    print(f"Date: {release['date']}, Headline: {release['headline']}, Description: {release['description']}\n")

Press Releases:
