# 01 - Data Ingestion: MSFT Stock Data

This notebook fetches historical stock data for Microsoft (MSFT) from Yahoo Finance and stores it in the Bronze layer of our Fabric Lakehouse using the medallion architecture.

## Objectives:
- Fetch MSFT historical stock data from yfinance API
- Validate data quality and integrity
- Store raw data in Bronze layer as Delta/Parquet format
- Generate data summary statistics

## Execution Schedule:
Daily at 4:30 PM ET (after market close)

In [8]:
import sys
from pathlib import Path
from datetime import datetime

sys.path.insert(0, str(Path.cwd().parent / 'utils'))

from data_loader import StockDataLoader, print_data_summary # type: ignore

print("✓ Libraries imported")
print(f"Execution: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

✓ Libraries imported
Execution: 2025-12-04 13:44:24


## Step 1: Initialize Data Loader

Configure the data loader with project settings from config.json

In [9]:
# Initialize data loader
config_path = Path.cwd().parent / 'config' / 'config.json'
loader = StockDataLoader(config_path=str(config_path))

# Configuration
TICKER = 'MSFT'
START_DATE = '2020-01-01'  # Fetch from 2020 for sufficient history
END_DATE = datetime.now().strftime('%Y-%m-%d')

print("Configuration loaded:")
print(f"  Ticker: {TICKER}")
print(f"  Date Range: {START_DATE} to {END_DATE}")
print(f"  Primary API: {loader.config['data_source']['primary_api']}")

Configuration loaded:
  Ticker: MSFT
  Date Range: 2020-01-01 to 2025-12-04
  Primary API: yfinance


## Step 2: Fetch Historical Stock Data

Download MSFT historical data from Yahoo Finance API

In [10]:
# Fetch historical data
df = loader.fetch_stock_data(
    ticker=TICKER,
    start_date=START_DATE,
    end_date=END_DATE,
    interval='1d'
)

# Display first and last few rows
print("\nFirst 5 rows:")
print(df.head())
print("\nLast 5 rows:")
print(df.tail())
print(f"\nDataFrame shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Fetching MSFT data from 2020-01-01 to 2025-12-04...
✓ Fetched 1489 records for MSFT

First 5 rows:
                       date        open        high         low       close  \
0 2020-01-02 00:00:00-05:00  150.758664  152.610150  150.331401  152.505707   
1 2020-01-03 00:00:00-05:00  150.321872  151.869516  150.074998  150.606705   
2 2020-01-06 00:00:00-05:00  149.144562  151.062519  148.603350  150.996048   
3 2020-01-07 00:00:00-05:00  151.271365  151.603675  149.372403  149.619263   
4 2020-01-08 00:00:00-05:00  150.901040  152.676579  149.970552  152.002441   

     volume  dividends  stock_splits Ticker             FetchTimestamp  year  \
0  22622100        0.0           0.0   MSFT 2025-12-04 13:44:25.424026  2020   
1  21116200        0.0           0.0   MSFT 2025-12-04 13:44:25.424026  2020   
2  20813700        0.0           0.0   MSFT 2025-12-04 13:44:25.424026  2020   
3  21634100        0.0           0.0   MSFT 2025-12-04 13:44:25.424026  2020   
4  27746500        0.0    

## Step 3: Data Quality Validation

Validate data integrity and quality

In [11]:
# Validate data
is_valid, issues = loader.validate_data(df)

if is_valid:
    print("✓ Data validation PASSED - Data is ready for storage")
else:
    print("✗ Data validation FAILED - Review issues before proceeding:")
    for issue in issues:
        print(f"  ! {issue}")

✓ Data validation passed
✓ Data validation PASSED - Data is ready for storage


## Step 4: Generate Data Summary

Calculate key statistics and metrics

In [12]:
# Generate and display summary
summary = loader.get_data_summary(df)
print_data_summary(summary)


DATA SUMMARY
Record Count:     1,489
Date Range:       2020-01-02 to 2025-12-03
Price Range:      $126.17 - $553.50
Latest Close:     $477.73
Avg Daily Volume: 27,640,162
Total Return:     213.25%
Avg Daily Return: 0.0942%



## Step 5: Store Data in Bronze Layer

Save raw data to Lakehouse Bronze layer with partitioning

**In Fabric**: Data will be stored as Delta table partitioned by year/month  
**Local Development**: Data will be stored as Parquet file

In [13]:
# Define lakehouse path
# In Fabric, this would be: /lakehouse/default/Files/
LAKEHOUSE_PATH = str(Path.cwd().parent / 'data')  # Local path for development

# Write to Bronze layer
bronze_path = loader.write_to_bronze(df, LAKEHOUSE_PATH)

print(f"\n{'='*60}")
print("DATA INGESTION COMPLETE")
print(f"{'='*60}")
print(f"Records stored: {len(df):,}")
print(f"Storage location: {Path(bronze_path).name}")
print("Ready for transformation pipeline")
print(f"{'='*60}")

Writing 1489 records to Bronze layer: bronze/stocks/stock_data
✓ Data written successfully to bronze/stocks/stock_data

DATA INGESTION COMPLETE
Records stored: 1,489
Storage location: stock_data
Ready for transformation pipeline


## Next Steps

1. **Run Notebook 02**: Data transformation and cleaning
2. **Schedule**: Configure daily execution at 4:30 PM ET in Fabric pipeline
3. **Monitor**: Check data quality metrics regularly

---
**Note**: For incremental loads (daily updates), use `fetch_latest_data()` method instead of full historical fetch.