# Stock and Macro Data Processing

## Overview
This script automates the retrieval, cleansing, and merging of financial stock and macroeconomic data using the `yfinance` API. It processes daily stock prices for top companies and macroeconomic indicators to create a comprehensive dataset for analysis.

## Approach
1. **Download Stock Data**  
   - Retrieve historical price data for selected top stock tickers.
   - Clean missing values and rename columns to maintain consistency.

2. **Merge Stock Data**  
   - Combine individual stock datasets into a unified DataFrame.
   - Standardize column names and manage time indices efficiently.

3. **Download Macroeconomic Indicators**  
   - Collect financial indices, commodity prices, sector ETFs, and cryptocurrencies.
   - Handle missing values using forward and backward filling.

4. **Final Merge & Export**  
   - Merge macroeconomic indicators into the cleaned stock dataset.
   - Save the final DataFrame as a CSV file for further analysis.

## Key Libraries Used
- `yfinance` for financial data retrieval
- `pandas` for data manipulation
- `matplotlib` & `seaborn` for visualization
- `missingno` for missing value diagnostics

This ensures a reliable, structured dataset that integrates both stock performance and macroeconomic influences for advanced analysis.


# Financial Insights in Stock and Economic Analysis

## Stock Prices (`open`, `high`, `low`, `close`)
Stock prices represent the value at which a stock is bought or sold during a trading session. These prices help traders and analysts assess market trends, volatility, and investor sentiment.

- **Opening Price (`open`)**: The first price at which a stock is traded when the market opens. It reflects overnight investor sentiment and can indicate gaps from previous closing prices.
- **Highest Price (`high`)**: The peak price during the trading day. This shows the maximum confidence investors had in the stock.
- **Lowest Price (`low`)**: The minimum price during the trading day. It highlights the lowest valuation investors were willing to accept.
- **Closing Price (`close`)**: The final price when the market closes. This is often used for trend analysis and is a key metric in stock price movement.

### Why These Prices Matter:
- Stock prices influence technical indicators such as moving averages, RSI, and Bollinger Bands.
- They help investors gauge momentum, trend direction, and possible entry or exit points for trades.

---

## Trading Volume (`volume`)
Trading volume represents the number of shares traded during a given time period. It acts as a measure of investor interest and liquidity in a stock.

- **High Volume:** Indicates strong investor participation, often accompanying price changes.
- **Low Volume:** Suggests weak investor interest, potentially leading to slower price movements or consolidation.

### Importance in Financial Markets:
- Volume confirms trends: A rising stock price with high volume suggests strong bullish sentiment.
- Sudden spikes in volume may indicate major events like earnings reports or mergers.

---

## Macro Indicators (e.g., `S&P500_Index`, `Gold_Futures`, `VIX_Index`)
Macroeconomic indicators track broader market and economic conditions, providing insights beyond individual stock movements.

- **Stock Market Indices (`S&P500_Index`, `NASDAQ_Composite`)**: Represent the performance of major companies and overall investor sentiment.
- **Commodity Prices (`Gold_Futures`, `WTI_Oil_Futures`)**: Reflect supply-demand dynamics and inflation trends.
- **Volatility Index (`VIX_Index`)**: Measures market fear and uncertainty; a rising VIX indicates higher expected market volatility.

### How Macro Indicators Influence Investing:
- Investors use macro indicators to predict economic cycles and market trends.
- They help in asset allocation strategies, guiding whether to invest in stocks, bonds, or commodities.

---

## Cryptocurrencies (`BTC-USD`, `ETH-USD`, etc.)
Cryptocurrencies represent digital assets operating on decentralized blockchain technology. Their values fluctuate based on market demand, regulation, and technological developments.

- **Bitcoin (`BTC-USD`)**: Often referred to as "digital gold," it serves as a store of value.
- **Ethereum (`ETH-USD`)**: A blockchain with smart contract capabilities, widely used in decentralized applications.
- **Volatility & Investor Sentiment**: Crypto markets tend to be more volatile than traditional assets, offering opportunities and risks.

### Role in Financial Markets:
- Cryptocurrencies are increasingly used for diversification and hedging against inflation.
- Institutions are adopting crypto, impacting stock market correlations and investment strategies.

By understanding these financial insights, analysts and investors can make informed decisions regarding asset allocation, risk management, and trading strategies.


In [None]:
# --- Import Required Libraries ---
import os
import yfinance as yf
import pandas as pd
import holidays
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# --- Download Configuration ---
start_date = "2014-01-01"
end_date = "2025-05-6"
# Define list of top stock tickers
tickers = [
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'META', 'NVDA', 'TSLA', 'JPM', 'V', 'JNJ',
    'WMT', 'PG', 'UNH', 'DIS', 'MA', 'HD', 'BAC', 'PFE', 'ADBE', 'PEP'
]
# List to collect all individual DataFrames
all_dfs = []
for ticker in tickers:
    try:
        #print(f"\n=== Downloading {ticker} ===")
        df = yf.download(ticker, start=start_date, end=end_date, interval="1d", progress=True)
        print(f"{ticker} columns BEFORE reset:\n{df.columns.tolist()}")
        df.dropna(inplace=True)
        df.reset_index(inplace=True)
        print(f"{ticker} DataFrame shape AFTER dropna and reset_index: {df.shape}")
        # Map column names to include ticker
        column_map = {
            'Open': f'open_{ticker}',
            'High': f'high_{ticker}',
            'Low': f'low_{ticker}',
            'Close': f'close_{ticker}',
            'Adj Close': f'adj_close_{ticker}',
            'Volume': f'volume_{ticker}'
        }
        # Only rename columns that exist
        existing_rename_map = {k: v for k, v in column_map.items() if k in df.columns}
        df.rename(columns=existing_rename_map, inplace=True)
        # Keep only Date and renamed columns
        columns_to_keep = ['Date'] + list(existing_rename_map.values())
        df = df[columns_to_keep]
        print(f"{ticker} final columns: {df.columns.tolist()}")
        all_dfs.append(df)
        print(f"[✓] {ticker} added to merge list.")
    except Exception as e:
        print(f"[✗] Error downloading {ticker}: {e}")
# Check if any data was downloaded
if not all_dfs:
    raise RuntimeError("No stock data was successfully downloaded.")
# === Merge all DataFrames ===
print("\n=== Merging all DataFrames ===")
merged_df = all_dfs[0].rename(columns={'Date': 'date'})
for i, df in enumerate(all_dfs[1:], start=1):
    df.rename(columns={'Date': 'date'}, inplace=True)
    print(f"Merging DF {i} with shape {df.shape}")
    merged_df = pd.merge(merged_df, df, on='date', how='outer')
    print(f"Merged DF shape now: {merged_df.shape}")
# Set 'date' as datetime index
print("\n=== Finalizing merged DataFrame ===")
merged_df['date'] = pd.to_datetime(merged_df['date'])
merged_df.set_index('date', inplace=True)
merged_df.sort_index(inplace=True)
# Clean up column names if MultiIndex
if isinstance(merged_df.columns, pd.MultiIndex):
    merged_df.columns = ['_'.join(filter(None, col)).strip() for col in merged_df.columns]
# Remove duplicated ticker names (e.g., open_AAPL_AAPL)
merged_df.columns = [col.replace('__', '_') for col in merged_df.columns]
merged_df.columns = ['_'.join(dict.fromkeys(col.split('_'))) if '_' in col else col for col in merged_df.columns]
merged_df.index.name = 'date'
clean_stock_data_df = merged_df.copy()
# Final debug print
print("\n:white_check_mark: Final merged DataFrame info:")
print(f"Shape: {clean_stock_data_df.shape}")
print(f"Columns: {clean_stock_data_df.columns.tolist()[:10]} ...")
clean_stock_data_df.head(10)
# --- Define Macro Tickers and Labels ---
macro_tickers = {
    # Indices
    "^GSPC": "S&P500_Index", "^DJI": "Dow_Jones_Index", "^IXIC": "NASDAQ_Composite", "^RUT": "Russell2000_Index",
    "^VIX": "VIX_Index", "^FTSE": "FTSE100_Index", "^N225": "Nikkei225_Index", "^GDAXI": "DAX_Index",
    "^FCHI": "CAC40_Index", "^HSI": "HangSeng_Index", "000001.SS": "SSE_Composite_Index", "399001.SZ": "SZSE_Component_Index",
    # Commodities & Funds
    "DX-Y.NYB": "Dollar_Index_DXY", "GC=F": "Gold_Futures", "CL=F": "WTI_Oil_Futures", "HG=F": "Copper_Futures",
    "BZ=F": "Brent_Crude_Futures", "USO": "US_Oil_Fund", "UNG": "US_Natural_Gas_Fund",
    "GLD": "SPDR_Gold_Shares", "SLV": "iShares_Silver_Trust", "PPLT": "Platinum_Shares_ETF",
    "GDX": "Gold_Miners_ETF", "GDXJ": "Junior_Gold_Miners_ETF", "XB=F": "RBOB_Gasoline_Futures",
    # Sector ETFs
    "XLK": "Tech_Sector_ETF", "XLE": "Energy_Sector_ETF", "XLF": "Financial_Sector_ETF", "XLY": "ConsumerDiscretionary_ETF",
    "LIT": "Lithium_ETF", "SMH": "Semiconductor_ETF", "XLU": "Electricity_Proxy", "XLV": "Healthcare_Sector_ETF",
    "XLI": "Industrial_Sector_ETF", "XLB": "Materials_Sector_ETF", "XLP": "ConsumerStaples_ETF",
    "XLRE": "Real_Estate_SPDR", "IWM": "Russell2000_ETF", "QQQ": "Nasdaq100_ETF", "VWO": "Emerging_Markets_ETF",
    "BND": "Total_Bond_Market_ETF", "VNQ": "Real_Estate_Vanguard", "KIE": "SPDR_Insurance_ETF",
    "IGV": "Tech_Software_Sector_ETF", "ARKK": "ARK_Innovation_ETF",
    # Financial Indices & Treasury
    "BKX": "KBW_Bank_Index", "KRE": "Regional_Banking_ETF", "VFH": "Financials_ETF",
    "^TNX": "10Y_Treasury_Yield", "^FVX": "5Y_Treasury_Yield", "IEF": "7_10Y_Treasury_ETF",
    "SHY": "1_3Y_Treasury_ETF", "TLT": "20Y_Treasury_ETF", "IYR": "Real_Estate_iShares",
    # Risk & Sentiment
    "VIX": "Volatility_Index", "PCE": "Personal_Consumption_Expenditures",
    # Cryptocurrencies
    "BTC-USD": "Bitcoin", "ETH-USD": "Ethereum", "XRP-USD": "XRP", "BNB-USD": "BNB", "SOL-USD": "Solana",
    "DOGE-USD": "Dogecoin", "ADA-USD": "Cardano", "TRX-USD": "TRON", "AVAX-USD": "Avalanche",
    "LINK-USD": "Chainlink", "XLM-USD": "Stellar", "UNI-USD": "Uniswap", "BCH-USD": "Bitcoin_Cash",
    "MATIC-USD": "Polygon", "LTC-USD": "Litecoin", "ATOM-USD": "Cosmos_Hub", "ETC-USD": "Ethereum_Classic",
    "XMR-USD": "Monero", "ALGO-USD": "Algorand", "VET-USD": "VeChain", "FIL-USD": "Filecoin",
    "ICP-USD": "Internet_Computer", "HBAR-USD": "Hedera", "NEAR-USD": "NEAR_Protocol",
    "AAVE-USD": "Aave", "EOS-USD": "EOS"
}
# --- Download Data ---
macro_df = pd.DataFrame()
for ticker, label in macro_tickers.items():
    print(f"Downloading: {label} ({ticker})")
    try:
        df = yf.download(ticker, start=start_date, end=end_date, progress=False)
        macro_df[label] = df['Close']
    except Exception as e:
        print(f":x: Error downloading {label}: {e}")
# --- Data Cleaning ---
macro_df.dropna(axis=1, how='all', inplace=True)  # Drop fully missing columns
macro_df.ffill(inplace=True)                      # Forward fill (macroeconomic data is slower to update)
macro_df.bfill(inplace=True)                      # Backward fill (to handle early missing data)
macro_df.index.name = "date"
# --- Diagnostics ---
print(":white_check_mark: Finished loading macro data.")
print(":mag: Null values remaining:", macro_df.isnull().sum().sum())
# :pushpin: Merge macroeconomic indicators into your cleaned stock dataset
# Assumes both DataFrames use datetime as index
clean_stock_data_df = clean_stock_data_df.merge(
    macro_df, left_index=True, right_index=True, how="left"
)
# Save to CSV
# Assuming 'clean_stock_data_df' is your DataFrame with 'date' as index
path_stock = "../data/Stock_market_data"
clean_stock_data_df.to_csv(f"{path_stock}/clean_stock_data_with_time_index.csv", index=True, date_format='%Y-%m-%d')
print(":white_check_mark: Saved full macro data to 'clean_stock_data_with_time_index.csv'")