# Financial Trading with Python, 2nd Edition
Cordell L. Tanny, CFA, FRM, FDP

## Chapter 3: Data Retrieval
### Notebook 3.2: Data Retrieval Lab

Version: 1

Date of last revision: January 18, 2026

This notebook covers the mechanics of downloading financial data from multiple sources and storing it locally. We will explore yfinance in depth, introduce Financial Modeling Prep (FMP) as a paid alternative, and show how to access economic data from FRED.

*Note and recommendation: Very often, you might want to experiment on your own as you go through this notebook. We recommend you save a copy of this notebook before you start adding cells or changing anything. This way you always have a pristine copy to go back to.*

---

## Setup

First, let's install and import the packages we need.

In [1]:
# Install packages if needed (uncomment if running in Colab)
# !pip install yfinance pandas-datareader --quiet

In [2]:
import yfinance as yf
import pandas as pd
import pandas_datareader.data as pdr
import numpy as np
import requests
import os
import warnings

# Display settings
pd.set_option('display.max_rows', 20)
pd.set_option('display.float_format', '{:.2f}'.format)
warnings.filterwarnings('ignore')

print(f"yfinance version: {yf.__version__}")
print(f"pandas version: {pd.__version__}")

yfinance version: 0.2.66
pandas version: 2.2.2


### Utility Function

Recent versions of yfinance return data with MultiIndex columns (e.g., `('Close', 'AAPL')` instead of just `Close`). This can cause errors when accessing columns directly or performing calculations. The following utility function flattens the MultiIndex to simple column names. We will use it throughout this notebook.

In [3]:
def flatten_columns(df):
    """Flatten MultiIndex columns from yfinance to single level."""
    if isinstance(df.columns, pd.MultiIndex):
        df.columns = df.columns.get_level_values(0)
    return df

---

## Section 1: Storing API Keys Securely

Before we start downloading data, we need to discuss how to handle API keys. Many data providers require an API key to authenticate your requests. How you store these keys matters.

### 1.1 Why Not Hardcode Keys

It is tempting to just paste your API key directly into your code:

```python
# DO NOT DO THIS
api_key = "abc123xyz789"
```

This is a bad idea for several reasons:

- **Security risk**: If you share your notebook or push it to GitHub, your key is exposed. Bots constantly scan public repositories for API keys.
- **Revocation headaches**: If your key is compromised and you need to revoke it, you have to find and update every notebook or script where you hardcoded it.
- **Collaboration problems**: If you share code with colleagues, they need to manually replace your key with theirs.

The solution is to store keys separately from your code.

### 1.2 Google Colab Secrets

Google Colab provides a built-in Secrets manager that keeps your API keys secure and separate from your code.

**To add a secret:**

1. Click the **key icon** in the left sidebar (above the folder icon)
2. Click **"Add new secret"**
3. Enter a name (e.g., `FMP_API_KEY`) and paste your key as the value
4. Toggle **"Notebook access"** to enable access from your code

**To retrieve a secret in your code:**

In [4]:
# This only works in Google Colab
# Uncomment if running in Colab and you have set up a secret

# from google.colab import userdata
# fmp_api_key = userdata.get('FMP_API_KEY')
# print(f"Key retrieved: {fmp_api_key[:4]}...{fmp_api_key[-4:]}")

The code above retrieves your key without ever displaying it in full or storing it in your notebook. If you share this notebook, the recipient will need to set up their own secret with the same name.

### 1.3 Environment Variables (Alternative)

If you are not using Colab, the standard approach is environment variables. You can set these in your terminal before running Python, or use a `.env` file with the `python-dotenv` package.

```python
import os
api_key = os.environ.get('FMP_API_KEY')
```

For local development, many developers use a `.env` file that is excluded from version control via `.gitignore`. This is beyond the scope of this notebook, but worth knowing about if you move beyond Colab.

---

## Section 2: yfinance Deep Dive

yfinance is a free, open-source library that downloads data from Yahoo Finance. It is our primary data source for this book. Let's explore its features in detail.

### 2.1 Basic Download Syntax

The main function is `yf.download()`. At minimum, you need to specify a ticker symbol.

In [5]:
# Download Microsoft data for the past 5 years
msft = yf.download('MSFT', start='2020-01-01', end='2024-12-31', progress=False)

print(f"Data shape: {msft.shape}")
print(f"Date range: {msft.index[0].strftime('%Y-%m-%d')} to {msft.index[-1].strftime('%Y-%m-%d')}")
print(f"\nColumns: {list(msft.columns)}")

Data shape: (1257, 5)
Date range: 2020-01-02 to 2024-12-30

Columns: [('Close', 'MSFT'), ('High', 'MSFT'), ('Low', 'MSFT'), ('Open', 'MSFT'), ('Volume', 'MSFT')]


The key parameters are:

- `start` and `end`: Define the date range (format: 'YYYY-MM-DD')
- `progress`: Set to `False` to suppress the download progress bar, which keeps output cleaner

Notice that by default, yfinance returns only five columns: `Close`, `High`, `Low`, `Open`, and `Volume`. There is no `Adj Close` column. This is because the default behavior applies adjustments automatically.

In [6]:
# Look at the data
msft = flatten_columns(msft)
msft.tail()

Price,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2024-12-23,432.06,434.45,429.66,433.54,19152500
2024-12-24,436.11,436.38,431.01,431.47,7164500
2024-12-26,434.9,437.71,433.43,435.86,8194200
2024-12-27,427.38,432.03,423.23,431.42,18117700
2024-12-30,421.72,424.42,418.81,422.94,13158700


### 2.2 The `auto_adjust` Parameter

The `auto_adjust` parameter controls whether yfinance returns adjusted or unadjusted prices. The default is `True`, which means:

- Prices are adjusted for splits and dividends
- The `Adj Close` column is not shown (because all prices are already adjusted)

For this book, we recommend setting `auto_adjust=False` so you can see both `Close` and `Adj Close` columns. This helps you understand what adjustments are being applied.

In [7]:
# Download with auto_adjust=False to see both Close and Adj Close
msft_unadj = yf.download('MSFT', start='2020-01-01', end='2024-12-31',
                          auto_adjust=False, progress=False)

print(f"Columns with auto_adjust=False: {list(msft_unadj.columns)}")

Columns with auto_adjust=False: [('Adj Close', 'MSFT'), ('Close', 'MSFT'), ('High', 'MSFT'), ('Low', 'MSFT'), ('Open', 'MSFT'), ('Volume', 'MSFT')]


In [8]:
msft_unadj = flatten_columns(msft_unadj)
msft_unadj[['Close', 'Adj Close']].tail()

Price,Close,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2024-12-23,435.25,432.06
2024-12-24,439.33,436.11
2024-12-26,438.11,434.9
2024-12-27,430.53,427.38
2024-12-30,424.83,421.72


Now we can see both columns. For recent data, they are nearly identical. The difference grows as you look further back in time due to cumulative dividend adjustments.

**Important clarification:** The `auto_adjust` parameter only controls whether price columns are adjusted and whether `Adj Close` appears. It does not affect volume adjustments or split adjustments. As we saw in Notebook 3.1, yfinance always returns split-adjusted data regardless of this setting.

### 2.3 Handling MultiIndex Columns

When you download data for a single ticker, yfinance still returns MultiIndex columns with the ticker name as the second level. This is a recent change that can cause confusion.

In [9]:
# Download fresh data to show the MultiIndex
googl = yf.download('GOOGL', start='2024-01-01', end='2024-12-31',
                     auto_adjust=False, progress=False)

print(f"Column type: {type(googl.columns)}")
print(f"\nRaw columns: {list(googl.columns)}")

Column type: <class 'pandas.core.indexes.multi.MultiIndex'>

Raw columns: [('Adj Close', 'GOOGL'), ('Close', 'GOOGL'), ('High', 'GOOGL'), ('Low', 'GOOGL'), ('Open', 'GOOGL'), ('Volume', 'GOOGL')]


The columns are tuples like `('Close', 'GOOGL')`. This can cause errors when you try to access a column directly:

In [10]:
# This might cause issues depending on how you access the column
# googl['Close']  # Returns a DataFrame, not a Series

# The flatten_columns function fixes this
googl = flatten_columns(googl)

print(f"Columns after flattening: {list(googl.columns)}")
print(f"\nNow googl['Close'] returns a Series:")
print(f"Type: {type(googl['Close'])}")

Columns after flattening: ['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume']

Now googl['Close'] returns a Series:
Type: <class 'pandas.core.series.Series'>


**Recommendation:** Apply `flatten_columns()` immediately after every `yf.download()` call when downloading a single ticker. This prevents subtle bugs later in your code.

### 2.4 Downloading Multiple Tickers

You can download multiple tickers in a single call by passing a list. This is more efficient than separate downloads.

In [11]:
# Download multiple tickers
tickers = ['MSFT', 'GOOGL', 'TLT', 'GLD']
multi = yf.download(tickers, start='2024-01-01', end='2024-12-31',
                     auto_adjust=False, progress=False)

print(f"Data shape: {multi.shape}")
print(f"\nColumn structure:")
print(multi.columns[:8])  # Show first 8 columns

Data shape: (251, 24)

Column structure:
MultiIndex([('Adj Close',   'GLD'),
            ('Adj Close', 'GOOGL'),
            ('Adj Close',  'MSFT'),
            ('Adj Close',   'TLT'),
            (    'Close',   'GLD'),
            (    'Close', 'GOOGL'),
            (    'Close',  'MSFT'),
            (    'Close',   'TLT')],
           names=['Price', 'Ticker'])


With multiple tickers, the MultiIndex structure makes more sense. The first level is the price type (`Close`, `High`, etc.) and the second level is the ticker. You can access data for a specific ticker and price type like this:

In [12]:
# Access Adj Close for all tickers
adj_close_all = multi['Adj Close']
adj_close_all.tail()

Ticker,GLD,GOOGL,MSFT,TLT
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2024-12-23,240.96,193.87,432.06,83.77
2024-12-24,241.44,195.34,436.11,84.13
2024-12-26,243.07,194.84,434.9,84.08
2024-12-27,241.4,192.01,427.38,83.39
2024-12-30,240.63,190.49,421.72,84.06


In [13]:
# Access all price data for a single ticker
msft_only = multi.xs('MSFT', level='Ticker', axis=1)
msft_only.tail()

Price,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,432.06,435.25,437.65,432.83,436.74,19152500
2024-12-24,436.11,439.33,439.6,434.19,434.65,7164500
2024-12-26,434.9,438.11,440.94,436.63,439.08,8194200
2024-12-27,427.38,430.53,435.22,426.35,434.6,18117700
2024-12-30,421.72,424.83,427.55,421.9,426.06,13158700


**When to download separately vs. together:**

- **Together:** When you need the same date range for all tickers and plan to analyze them as a group (e.g., portfolio analysis, correlation studies)
- **Separately:** When tickers have different date ranges, when you need different parameters for each, or when you want simpler single-level column names

**Working with Complete OHLC Data for Multiple Tickers**

Often you need all price columns (Open, High, Low, Close, Adj Close, Volume) for each ticker, not just the closing prices. There are several ways to access this data from a multi-ticker download.

In [14]:
# Reminder of what we downloaded
print(f"Column levels: {multi.columns.names}")
print(f"\nPrice types available: {multi.columns.get_level_values(0).unique().tolist()}")
print(f"Tickers available: {multi.columns.get_level_values(1).unique().tolist()}")

Column levels: ['Price', 'Ticker']

Price types available: ['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume']
Tickers available: ['GLD', 'GOOGL', 'MSFT', 'TLT']


The DataFrame has two levels: Price (first level) and Ticker (second level). We can slice this in different ways depending on what we need.

The `.xs()` method (short for "cross-section") selects data at a particular level of a MultiIndex. You specify the value you want to select `('MSFT')`, which level to look in `(level='Ticker')`, and which axis `(axis=1 for columns, axis=0 for rows)`. It returns a DataFrame with that level removed, giving you simple single-level column names.

This is an incredibly efficient method to extract information from a MultiIndex DataFrame!

In [15]:
# Method 1: Get all columns for a single ticker using .xs()
msft_all = multi.xs('MSFT', level='Ticker', axis=1)
print("All OHLCV data for MSFT:")
msft_all.tail()

All OHLCV data for MSFT:


Price,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,432.06,435.25,437.65,432.83,436.74,19152500
2024-12-24,436.11,439.33,439.6,434.19,434.65,7164500
2024-12-26,434.9,438.11,440.94,436.63,439.08,8194200
2024-12-27,427.38,430.53,435.22,426.35,434.6,18117700
2024-12-30,421.72,424.83,427.55,421.9,426.06,13158700


The `.swaplevel()` method reverses the order of levels in a MultiIndex. Our original structure is Price → Ticker (so multi`['Close']['MSFT']` gives you MSFT's close prices). After swapping, the structure becomes Ticker → Price (so `multi_swapped['MSFT']['Close']` gives the same result). This makes it easier to grab all columns for a single ticker with simple bracket notation. The `axis=1` specifies we are swapping column levels, and `.sort_index(axis=1)` keeps the columns in alphabetical order.

In [16]:
# Method 2: Swap levels and access by ticker first
multi_swapped = multi.swaplevel(axis=1).sort_index(axis=1)
print(f"New column structure: {multi_swapped.columns[:8].tolist()}")

New column structure: [('GLD', 'Adj Close'), ('GLD', 'Close'), ('GLD', 'High'), ('GLD', 'Low'), ('GLD', 'Open'), ('GLD', 'Volume'), ('GOOGL', 'Adj Close'), ('GOOGL', 'Close')]


In [17]:
# Now we can access all data for a ticker with simple indexing
googl_all = multi_swapped['GOOGL']
print("\nAll OHLCV data for GOOGL:")
googl_all.tail()


All OHLCV data for GOOGL:


Price,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,193.87,194.63,195.1,190.15,192.62,25675000
2024-12-24,195.34,196.11,196.11,193.78,194.84,10403300
2024-12-26,194.84,195.6,196.75,194.38,195.15,12046600
2024-12-27,192.01,192.76,195.32,190.65,194.95,18891400
2024-12-30,190.49,191.24,192.55,189.12,189.8,14264700


**Creating Separate DataFrames for Each Ticker**

If you prefer to work with individual DataFrames for each ticker (each with simple column names), you can split the multi-ticker download into a dictionary of DataFrames.

In [18]:
# Split into dictionary of DataFrames, one per ticker
ticker_dfs = {}
for ticker in tickers:
    ticker_dfs[ticker] = multi.xs(ticker, level='Ticker', axis=1)

# Now you have clean, separate DataFrames
print(f"Available tickers: {list(ticker_dfs.keys())}")
print(f"\nTLT columns: {list(ticker_dfs['TLT'].columns)}")
ticker_dfs['TLT'].tail()

Available tickers: ['MSFT', 'GOOGL', 'TLT', 'GLD']

TLT columns: ['Adj Close', 'Close', 'High', 'Low', 'Open', 'Volume']


Price,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,83.77,87.5,88.23,87.44,88.16,32764600
2024-12-24,84.13,87.87,87.89,86.98,87.04,22377600
2024-12-26,84.08,87.82,87.96,87.2,87.21,19981800
2024-12-27,83.39,87.1,87.78,87.06,87.48,27262300
2024-12-30,84.06,87.8,88.04,87.67,87.83,48519600


This dictionary approach gives you the best of both worlds: a single efficient download, but simple single-level column names when you work with individual tickers. You can access any ticker with `ticker_dfs['MSFT']` and work with it just like a single-ticker download.

In [19]:
# Example: Calculate returns for each ticker
for ticker, df in ticker_dfs.items():
    total_return = (df['Adj Close'].iloc[-1] / df['Adj Close'].iloc[0] - 1) * 100
    print(f"{ticker}: {total_return:.2f}% total return")

MSFT: 15.41% total return
GOOGL: 38.91% total return
TLT: -7.02% total return
GLD: 26.17% total return


### 2.5 The Ticker Object

For detailed information about a specific security, use `yf.Ticker()`. This gives you access to much more than just price data.

In [20]:
# Create a Ticker object
msft_ticker = yf.Ticker('MSFT')

In [21]:
# Get company information
info = msft_ticker.info

# Display selected fields
fields = ['shortName', 'sector', 'industry', 'marketCap', 'trailingPE', 'dividendYield']
for field in fields:
    value = info.get(field, 'N/A')
    if field == 'marketCap' and value != 'N/A':
        value = f"${value/1e9:.1f}B"
    if field == 'dividendYield' and value != 'N/A':
        value = f"{value*100:.2f}%"
    print(f"{field}: {value}")

shortName: Microsoft Corporation
sector: Technology
industry: Software - Infrastructure
marketCap: $3291.6B
trailingPE: 31.540165
dividendYield: 80.00%


The `.info` dictionary contains dozens of fields. The available fields can vary by security type and may change as yfinance is updated. For a complete list of Ticker attributes and methods, see the yfinance documentation at:

 https://github.com/ranaroussi/yfinance.

 You can also explore what is available interactively:

In [22]:
# See all available attributes and methods on a Ticker object
ticker_attributes = [attr for attr in dir(msft_ticker) if not attr.startswith('_')]
print(f"Available attributes/methods ({len(ticker_attributes)} total):")
print(ticker_attributes)

Available attributes/methods (99 total):
['actions', 'analyst_price_targets', 'balance_sheet', 'balancesheet', 'calendar', 'capital_gains', 'cash_flow', 'cashflow', 'dividends', 'earnings', 'earnings_dates', 'earnings_estimate', 'earnings_history', 'eps_revisions', 'eps_trend', 'fast_info', 'financials', 'funds_data', 'get_actions', 'get_analyst_price_targets', 'get_balance_sheet', 'get_balancesheet', 'get_calendar', 'get_capital_gains', 'get_cash_flow', 'get_cashflow', 'get_dividends', 'get_earnings', 'get_earnings_dates', 'get_earnings_estimate', 'get_earnings_history', 'get_eps_revisions', 'get_eps_trend', 'get_fast_info', 'get_financials', 'get_funds_data', 'get_growth_estimates', 'get_history_metadata', 'get_income_stmt', 'get_incomestmt', 'get_info', 'get_insider_purchases', 'get_insider_roster_holders', 'get_insider_transactions', 'get_institutional_holders', 'get_isin', 'get_major_holders', 'get_mutualfund_holders', 'get_news', 'get_recommendations', 'get_recommendations_summar

Common useful attributes include `.info`, `.history()`, `.dividends`, `.splits`, `.financials`, `.balance_sheet`, .`cashflow`, `.recommendations`, and `.calendar`.

Not all attributes return data for every ticker, and some may require a premium Yahoo Finance subscription.

Let's look at `.dividends` and `.splits` as examples.

In [23]:
# Get dividend history
dividends = msft_ticker.dividends
print(f"Number of dividend payments: {len(dividends)}")
print(f"\nRecent dividends:")
print(dividends.tail(8).to_string())

Number of dividend payments: 88

Recent dividends:
Date
2024-02-14 00:00:00-05:00   0.75
2024-05-15 00:00:00-04:00   0.75
2024-08-15 00:00:00-04:00   0.75
2024-11-21 00:00:00-05:00   0.83
2025-02-20 00:00:00-05:00   0.83
2025-05-15 00:00:00-04:00   0.83
2025-08-21 00:00:00-04:00   0.83
2025-11-20 00:00:00-05:00   0.91


In [24]:
# Get split history
splits = msft_ticker.splits
print(f"Stock splits:")
print(splits[splits > 0].to_string())

Stock splits:
Date
1987-09-21 00:00:00-04:00   2.00
1990-04-16 00:00:00-04:00   2.00
1991-06-27 00:00:00-04:00   1.50
1992-06-15 00:00:00-04:00   1.50
1994-05-23 00:00:00-04:00   2.00
1996-12-09 00:00:00-05:00   2.00
1998-02-23 00:00:00-05:00   2.00
1999-03-29 00:00:00-05:00   2.00
2003-02-18 00:00:00-05:00   2.00


The Ticker object also has a `.history()` method that works similarly to `yf.download()`. The main difference is that `.history()` is designed for single tickers and offers some additional parameters. For most use cases, `yf.download()` is sufficient and more commonly used.

### 2.6 Date Ranges and Periods

You have two ways to specify the time range for your download.

**Option 1: Explicit dates with `start` and `end`**

This is the most precise approach and what we recommend for reproducible research.

In [25]:
# Explicit date range
df = yf.download('TLT', start='2023-06-01', end='2023-12-31',
                  auto_adjust=False, progress=False)
df = flatten_columns(df)

print(f"Date range: {df.index[0].strftime('%Y-%m-%d')} to {df.index[-1].strftime('%Y-%m-%d')}")
print(f"Trading days: {len(df)}")

Date range: 2023-06-01 to 2023-12-29
Trading days: 147


**Option 2: Relative periods with `period`**

This is convenient for quick analysis but less reproducible since the date range changes each time you run the code.

In [26]:
# Relative period - last 6 months
df = yf.download('TLT', period='6mo', auto_adjust=False, progress=False)
df = flatten_columns(df)

print(f"Date range: {df.index[0].strftime('%Y-%m-%d')} to {df.index[-1].strftime('%Y-%m-%d')}")
print(f"Trading days: {len(df)}")

Date range: 2025-07-21 to 2026-01-21
Trading days: 128


Valid period values include: `1d`, `5d`, `1mo`, `3mo`, `6mo`, `1y`, `2y`, `5y`, `10y`, `ytd`, `max`

**Intraday Data with `interval`**

You can download intraday data by specifying the `interval` parameter. However, there are significant limitations.

In [27]:
# Download hourly data for the past 5 days
df_hourly = yf.download('MSFT', period='5d', interval='1h',
                         auto_adjust=False, progress=False)
df_hourly = flatten_columns(df_hourly)

print(f"Hourly data shape: {df_hourly.shape}")
print(f"\nFirst few rows:")
df_hourly.head()

Hourly data shape: (35, 6)

First few rows:


Price,Adj Close,Close,High,Low,Open,Volume
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2026-01-14 14:30:00+00:00,464.69,464.69,468.17,463.2,467.33,3705306
2026-01-14 15:30:00+00:00,463.9,463.9,465.11,461.19,464.68,2123491
2026-01-14 16:30:00+00:00,461.62,461.62,464.29,460.73,463.93,1699793
2026-01-14 17:30:00+00:00,460.58,460.58,462.52,460.25,461.59,1631373
2026-01-14 18:30:00+00:00,458.83,458.83,460.74,458.1,460.59,1981300


**Intraday data limitations:**

- `1m` (1-minute): Only available for the last 7 days
- `5m`, `15m`, `30m`: Only available for the last 60 days
- `1h` (hourly): Only available for the last 730 days

For longer historical intraday data, you would need a paid data provider. For most trading strategies developed in this book, daily data is sufficient.

---

## Section 3: Financial Modeling Prep (FMP)

Financial Modeling Prep is a commercial data provider that offers historical prices, fundamental data, and economic indicators via API.

### 3.1 Overview

FMP offers a wide range of financial data:

- Historical daily prices (similar to yfinance)
- Company fundamentals (income statements, balance sheets, cash flow)
- Financial ratios and metrics
- Economic indicators
- Earnings calendars and transcripts

**The reality of free tiers:** FMP used to offer a generous free tier, but like most data providers, they have significantly restricted it. The free tier is now extremely limited and not practical for real work. If you want to use FMP, plan on a paid subscription.

**When paid data makes sense:**

- You are trading with real money and need reliability
- You need fundamental data (financials, ratios) that yfinance does not provide reliably
- You need guaranteed uptime and support
- You need point-in-time data to avoid lookahead bias (available on higher tiers)

**To get an API key:** Visit https://financialmodelingprep.com and create an account.

### 3.2 Storing Your FMP Key

If you have an FMP API key, store it using Colab Secrets as described in Section 1. Name it `FMP_API_KEY`.

In [28]:
# Retrieve your FMP API key from Colab Secrets
# Uncomment these lines if you have set up the secret

# from google.colab import userdata
# FMP_API_KEY = userdata.get('FMP')

# For demonstration, we'll use a placeholder
# Replace with your actual key retrieval method
FMP_API_KEY = None  # Set to your key or retrieve from secrets

### 3.3 Basic API Calls

FMP uses a REST API. You construct a URL with your query parameters and API key, make an HTTP request, and parse the JSON response.

In [29]:
def get_fmp_data(endpoint, params=None):
    """
    Fetch data from Financial Modeling Prep API (stable version).

    Parameters:
    -----------
    endpoint : str
        API endpoint (e.g., 'historical-price-eod/full')
    params : dict, optional
        Query parameters including symbol

    Returns:
    --------
    dict or list : JSON response from API
    """
    if FMP_API_KEY is None:
        print("FMP_API_KEY not set. Please configure your API key.")
        return None

    base_url = "https://financialmodelingprep.com/stable"
    url = f"{base_url}/{endpoint}"

    # Add API key to parameters
    if params is None:
        params = {}
    params['apikey'] = FMP_API_KEY

    response = requests.get(url, params=params)

    if response.status_code == 200:
        return response.json()
    else:
        print(f"Error: {response.status_code}")
        print(response.text)
        return None

This function handles the boilerplate of making API requests. You provide the endpoint and any additional parameters, and it returns the parsed JSON response.

### 3.4 Downloading Historical Prices

The historical price endpoint returns daily OHLCV data similar to yfinance.

In [30]:
def get_fmp_prices(ticker, start_date=None, end_date=None):
    """
    Download historical prices from FMP (stable API).

    Parameters:
    -----------
    ticker : str
        Stock ticker symbol
    start_date : str, optional
        Start date (YYYY-MM-DD)
    end_date : str, optional
        End date (YYYY-MM-DD)

    Returns:
    --------
    DataFrame with OHLCV data
    """
    params = {'symbol': ticker}
    if start_date:
        params['from'] = start_date
    if end_date:
        params['to'] = end_date

    data = get_fmp_data('historical-price-eod/full', params)

    if data is None or len(data) == 0:
        return None

    df = pd.DataFrame(data)
    df['date'] = pd.to_datetime(df['date'])
    df = df.set_index('date').sort_index()

    return df

# Example usage (only works with valid API key)
# df_fmp = get_fmp_prices('MSFT', '2024-01-01', '2024-12-31')
# df_fmp.head()

FMP returns columns including `open`, `high`, `low`, `close`, `adjClose`, `volume`, `change`, and `changePercent`. The data is generally reliable and includes adjusted close prices.

**Advantages of FMP over yfinance:**

- More reliable uptime (it is a paid service with SLAs)
- Customer support
- Consistent data structure that does not change unexpectedly
- Access to fundamental data in the same API
- Point-in-time data available on higher tiers

### 3.5 Accessing Fundamental Data

One of FMP's main advantages is access to company financial statements. This is useful for fundamental analysis and factor-based strategies.

In [31]:
def get_fmp_income_statement(ticker, period='annual', limit=5):
    """
    Download income statement data from FMP (stable API).

    Parameters:
    -----------
    ticker : str
        Stock ticker symbol
    period : str
        'annual' or 'quarter'
    limit : int
        Number of periods to retrieve

    Returns:
    --------
    DataFrame with income statement data
    """
    params = {'symbol': ticker, 'period': period, 'limit': limit}
    data = get_fmp_data('income-statement', params)

    if data is None:
        return None

    return pd.DataFrame(data)

# Example usage (only works with valid API key)
# income = get_fmp_income_statement('MSFT')
# income[['date', 'revenue', 'netIncome', 'eps']].head()

Similar functions can be built for balance sheets (`balance-sheet-statement/{ticker}`) and cash flow statements (`cash-flow-statement/{ticker}`). We will not use fundamental data in our case study, but it is valuable for factor-based strategies covered in later chapters.

### 3.6 When to Pay for Data

Here is a practical framework for deciding when free data (yfinance) is sufficient and when you should invest in paid data:

**Use free data (yfinance) when:**

- Learning and experimentation
- Personal research projects
- Backtesting ideas before committing resources
- Academic or educational purposes

**Consider paid data when:**

- Trading with real money
- You need guaranteed uptime and reliability
- You need fundamental data (financials, ratios)
- You need point-in-time data to properly avoid lookahead bias
- You need customer support when things break
- Your time is valuable and debugging data issues costs more than a subscription

The old saying applies: you get what you pay for. Free data is fine for learning. When real money is on the line, the cost of a data subscription is trivial compared to the cost of bad data leading to bad decisions.

---

## Section 4: FRED via pandas-datareader

The Federal Reserve Economic Data (FRED) database is a treasure trove of economic indicators. It is free, reliable, and easy to access via `pandas-datareader`.

### 4.1 What is FRED?

FRED is maintained by the Federal Reserve Bank of St. Louis and contains over 800,000 economic time series. Common series include:

- **GDP**: Gross Domestic Product
- **UNRATE**: Unemployment Rate
- **CPIAUCSL**: Consumer Price Index (inflation)
- **DFF**: Federal Funds Rate
- **T10Y2Y**: 10-Year minus 2-Year Treasury spread (yield curve)
- **VIXCLS**: VIX (alternative to Yahoo)

You can browse all available series at https://fred.stlouisfed.org

### 4.2 Basic Usage

The `pandas-datareader` package provides a simple interface to FRED.

In [32]:
# Download the Federal Funds Rate
fed_funds = pdr.DataReader(
    name='DFF',
    data_source='fred',
    start='2020-01-01',
    end='2024-12-31'
)

print(f"Data shape: {fed_funds.shape}")
print(f"Date range: {fed_funds.index[0].strftime('%Y-%m-%d')} to {fed_funds.index[-1].strftime('%Y-%m-%d')}")
fed_funds.tail(10)

Data shape: (1827, 1)
Date range: 2020-01-01 to 2024-12-31


Unnamed: 0_level_0,DFF
DATE,Unnamed: 1_level_1
2024-12-22,4.33
2024-12-23,4.33
2024-12-24,4.33
2024-12-25,4.33
2024-12-26,4.33
2024-12-27,4.33
2024-12-28,4.33
2024-12-29,4.33
2024-12-30,4.33
2024-12-31,4.33


The syntax is straightforward: specify the series code, the source (`'fred'`), and the date range. The result is a DataFrame indexed by date.

In [33]:
# Download multiple series at once
series = ['UNRATE', 'CPIAUCSL', 'T10Y2Y']
econ_data = pdr.DataReader(
    name=series,
    data_source='fred',
    start='2020-01-01',
    end='2024-12-31'
)

print(f"Columns: {list(econ_data.columns)}")
print(f"Total rows: {len(econ_data)}")
print(f"\nSample of data (note the NaN pattern):")
econ_data.loc['2024-01-01':'2024-01-15']

Columns: ['UNRATE', 'CPIAUCSL', 'T10Y2Y']
Total rows: 1321

Sample of data (note the NaN pattern):


Unnamed: 0_level_0,UNRATE,CPIAUCSL,T10Y2Y
DATE,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2024-01-01,3.7,309.79,
2024-01-02,,,-0.38
2024-01-03,,,-0.42
2024-01-04,,,-0.39
2024-01-05,,,-0.35
2024-01-08,,,-0.35
2024-01-09,,,-0.34
2024-01-10,,,-0.33
2024-01-11,,,-0.28
2024-01-12,,,-0.18


Notice that different series have different frequencies. `UNRATE` (unemployment) is monthly, while `T10Y2Y` (yield curve spread) is daily. When you download multiple series, `pandas-datareader` returns all of them aligned to the finest frequency, with `NaN` values where data is not available.

### 4.3 The Economic Data Trap

Economic data from FRED is free and high quality. However, using it in trading strategies is fraught with pitfalls. This is one of the most common mistakes I see from students and even professionals.

**Problem 1: Frequency Mismatch**

Most economic indicators are released monthly or quarterly. If you are building a daily trading strategy, you need to decide how to handle this mismatch. Forward-filling (using the last known value) is common but introduces its own issues.

In [34]:
# GDP is quarterly
gdp = pdr.DataReader('GDP', 'fred', start='2023-01-01', end='2024-12-31')
print(f"GDP observations: {len(gdp)}")
print(f"\nGDP data:")
print(gdp.to_string())

GDP observations: 8

GDP data:
                GDP
DATE               
2023-01-01 27216.44
2023-04-01 27530.06
2023-07-01 28074.85
2023-10-01 28424.72
2024-01-01 28708.16
2024-04-01 29147.04
2024-07-01 29511.66
2024-10-01 29825.18


You only get a handful of observations per year. Building a robust trading signal from 4-8 data points is statistically questionable.

**Problem 2: Publication Lag**

Economic data is not available in real-time. GDP for Q1 is not released until late April. Unemployment for January is not released until early February. If your backtest uses January unemployment data on January 31st, you have introduced lookahead bias because that data was not actually available until weeks later.

**Problem 3: Revisions**

This is the most insidious problem. Economic data gets revised, sometimes significantly. The GDP number released in April for Q1 is an "advance estimate." It gets revised in May ("second estimate") and June ("third estimate"). Sometimes it gets revised years later.

When you download GDP from FRED today, you get the *current* values, which include all revisions. But when you are backtesting a strategy that would have traded in April 2023, you should only use the data that was available in April 2023, not the revised numbers.

This creates lookahead bias that is very difficult to detect. Your backtest looks great because it traded on revised data that was not actually known at the time.

**When Economic Data Can Be Useful**

Despite these challenges, economic data is not useless. It can be valuable for:

- **Long-term regime indicators**: Identifying whether we are in expansion or recession (if you account for publication lag)
- **Context and analysis**: Understanding the macro environment even if you do not trade on it directly
- **Low-frequency strategies**: Monthly or quarterly rebalancing where the timing issues matter less

For our case study in this book, we use daily price data where these issues do not apply. The VIX and SPY prices we use are available in real-time and are not revised after the fact.

---

## Section 5: Data Storage

Once you download data, you should store it locally. This avoids repeated API calls, ensures reproducibility, and speeds up your workflow.

### 5.1 Why Store Data Locally?

- **Avoid repeated API calls**: Downloading the same data repeatedly is wasteful and can hit rate limits
- **Reproducibility**: Data providers can change historical data retroactively; your local copy preserves the data as it was when you downloaded it
- **Speed**: Reading from a local file is much faster than an API call
- **Offline work**: You can work without an internet connection
- **Version control**: You can track changes to your data over time

### 5.2 CSV Files

CSV (Comma-Separated Values) is the most universal format. Any program can read it, and you can open it in a text editor to inspect the contents.

In [35]:
# Download some data to save
gld = yf.download('GLD', start='2020-01-01', end='2024-12-31',
                   auto_adjust=False, progress=False)
gld = flatten_columns(gld)

print(f"Data shape: {gld.shape}")
gld.tail()

Data shape: (1257, 6)


Price,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,240.96,240.96,241.67,240.65,241.6,5835500
2024-12-24,241.44,241.44,241.66,240.82,241.49,2421000
2024-12-26,243.07,243.07,243.56,242.2,242.39,4645100
2024-12-27,241.4,241.4,241.95,241.05,241.2,4728100
2024-12-30,240.63,240.63,241.08,239.58,241.08,3522500


In [36]:
# Save to CSV
gld.to_csv('gld_prices.csv')
print("Saved to gld_prices.csv")

Saved to gld_prices.csv


Important note for Google Colab users: When you save a file in Colab, it is stored in the temporary runtime instance, not on your Google Drive. If your Colab session disconnects or you terminate the runtime, the file is lost. To keep the file, you have two options:

Download it directly: Click the folder icon in the left sidebar, find the file, right-click, and select "Download."
Mount Google Drive: You can mount your Google Drive to Colab and save files there for permanent storage. This is covered in many online tutorials but is beyond the scope of this notebook.

For quick experiments, downloading is sufficient. For ongoing projects where you want persistent storage, mounting Google Drive is worth learning.

In [37]:
# Check what was saved
!head -5 gld_prices.csv

Date,Adj Close,Close,High,Low,Open,Volume
2020-01-02,143.9499969482422,143.9499969482422,144.2100067138672,143.39999389648438,143.86000061035156,7733800
2020-01-03,145.86000061035156,145.86000061035156,146.32000732421875,145.39999389648438,145.75,12272800
2020-01-06,147.38999938964844,147.38999938964844,148.47999572753906,146.9499969482422,148.44000244140625,14403300
2020-01-07,147.97000122070312,147.97000122070312,148.13999938964844,147.42999267578125,147.57000732421875,7978500


The index (dates) is saved as the first column. When you read the file back, you need to tell pandas which column is the index and that it contains dates.

In [38]:
# Read from CSV
gld_loaded = pd.read_csv('gld_prices.csv', index_col=0, parse_dates=True)

print(f"Loaded shape: {gld_loaded.shape}")
print(f"Index type: {type(gld_loaded.index)}")
gld_loaded.tail()

Loaded shape: (1257, 6)
Index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>


Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,240.96,240.96,241.67,240.65,241.6,5835500
2024-12-24,241.44,241.44,241.66,240.82,241.49,2421000
2024-12-26,243.07,243.07,243.56,242.2,242.39,4645100
2024-12-27,241.4,241.4,241.95,241.05,241.2,4728100
2024-12-30,240.63,240.63,241.08,239.58,241.08,3522500


The `index_col=0` parameter tells pandas that the first column is the index. The `parse_dates=True` parameter converts it to datetime format.

**Pros of CSV:**
- Universal format, readable by any software
- Human-readable (you can open it in a text editor)
- Works well with version control (git can show diffs)

**Cons of CSV:**
- Larger file size than binary formats
- Slower to read/write for large datasets
- Can lose precision for floating point numbers
- Does not preserve data types (everything becomes strings on disk)

### 5.3 Excel Files

Excel format is useful when you need to share data with non-programmers or want to do quick manual inspection in a spreadsheet.

In [39]:
# Save to Excel
gld.to_excel('gld_prices.xlsx')
print("Saved to gld_prices.xlsx")

Saved to gld_prices.xlsx


In [40]:
# Read from Excel
gld_excel = pd.read_excel('gld_prices.xlsx', index_col=0)

print(f"Loaded shape: {gld_excel.shape}")
gld_excel.tail()

Loaded shape: (1257, 6)


Unnamed: 0_level_0,Adj Close,Close,High,Low,Open,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2024-12-23,240.96,240.96,241.67,240.65,241.6,5835500
2024-12-24,241.44,241.44,241.66,240.82,241.49,2421000
2024-12-26,243.07,243.07,243.56,242.2,242.39,4645100
2024-12-27,241.4,241.4,241.95,241.05,241.2,4728100
2024-12-30,240.63,240.63,241.08,239.58,241.08,3522500


**Pros of Excel:**
- Easy to share with non-technical colleagues
- Can include multiple sheets in one file
- Supports formatting (though pandas does not preserve it)

**Cons of Excel:**
- Slower than CSV for reading and writing
- Larger file sizes
- Row limits (1,048,576 rows per sheet)
- Can introduce formatting issues with dates

### 5.4 Pickle Files

Pickle is Python's native serialization format. It preserves the exact state of your DataFrame, including data types, index, and metadata.

In [41]:
# Save to pickle
gld.to_pickle('gld_prices.pkl')
print("Saved to gld_prices.pkl")

Saved to gld_prices.pkl


In [42]:
# Read from pickle
gld_pickle = pd.read_pickle('gld_prices.pkl')

print(f"Loaded shape: {gld_pickle.shape}")
print(f"Index type: {type(gld_pickle.index)}")
print(f"\nData types:")
print(gld_pickle.dtypes)

Loaded shape: (1257, 6)
Index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

Data types:
Price
Adj Close    float64
Close        float64
High         float64
Low          float64
Open         float64
Volume         int64
dtype: object


Notice that we did not need to specify `index_col` or `parse_dates`. Pickle preserves everything exactly as it was.

**Pros of Pickle:**
- Fast read/write
- Preserves all DataFrame metadata (dtypes, index, column names)
- Compact file size
- No parsing needed when loading

**Cons of Pickle:**
- Python-specific (cannot open in Excel or other tools)
- Not human-readable
- Version compatibility issues: A pickle file created with one version of pandas may fail to load in a different version. This is a significant problem for long-term storage or sharing code across environments.
- Security risk if loading pickles from untrusted sources

Because of the version compatibility issue, many data scientists have moved away from pickle for data storage. The preferred alternative is parquet.

### 5.5 Parquet Files

Parquet has become the preferred format for storing DataFrames in data science workflows. It solves the main problems with both CSV (slow, loses types) and pickle (version issues, Python-only).

In [43]:
# Save to parquet
gld.to_parquet('gld_prices.parquet')
print("Saved to gld_prices.parquet")

Saved to gld_prices.parquet


In [44]:
# Read from parquet
gld_parquet = pd.read_parquet('gld_prices.parquet')

print(f"Loaded shape: {gld_parquet.shape}")
print(f"Index type: {type(gld_parquet.index)}")
print(f"\nData types:")
print(gld_parquet.dtypes)

Loaded shape: (1257, 6)
Index type: <class 'pandas.core.indexes.datetimes.DatetimeIndex'>

Data types:
Price
Adj Close    float64
Close        float64
High         float64
Low          float64
Open         float64
Volume         int64
dtype: object


Like pickle, parquet preserves data types and requires no special parameters when loading. The index and column types are exactly as you saved them.

In [45]:
# Compare file sizes
import os

gld.to_csv('gld_prices.csv')
gld.to_pickle('gld_prices.pkl')
gld.to_parquet('gld_prices.parquet')

for fmt in ['csv', 'pkl', 'parquet']:
    size = os.path.getsize(f'gld_prices.{fmt}')
    print(f"{fmt:>8}: {size:,} bytes")

     csv: 138,403 bytes
     pkl: 71,527 bytes
 parquet: 64,822 bytes


Parquet files are compressed by default, typically resulting in smaller file sizes than both CSV and pickle. The compression also speeds up read and write operations for larger datasets.

<table>
  <tr>
    <th>Feature</th>
    <th>Pickle</th>
    <th>Parquet</th>
  </tr>
  <tr>
    <td>Speed</td>
    <td>Very fast</td>
    <td>Fast</td>
  </tr>
  <tr>
    <td>File size</td>
    <td>Medium</td>
    <td>Small (compressed)</td>
  </tr>
  <tr>
    <td>Preserves dtypes</td>
    <td>Yes</td>
    <td>Yes</td>
  </tr>
  <tr>
    <td>Cross-platform</td>
    <td>No (Python only)</td>
    <td>Yes (Python, R, Spark, etc.)</td>
  </tr>
  <tr>
    <td>Version stability</td>
    <td>Poor (pandas version dependent)</td>
    <td>Excellent</td>
  </tr>
  <tr>
    <td>Human readable</td>
    <td>No</td>
    <td>No</td>
  </tr>
</table>

### 5.6 A Note on HDF5 and Very Large Datasets

For most trading strategies using daily data, parquet handles everything you need. However, if you work with tick data, high-frequency intraday data across many assets, or datasets that do not fit in memory, you may encounter HDF5 (.h5 files).
HDF5 is designed for very large datasets. Its key advantage is that you can read and write slices of data without loading the entire file into memory. This matters when your dataset is tens or hundreds of gigabytes.

In [46]:
# Example syntax (not executed)
# df.to_hdf('large_data.h5', key='prices', mode='w')
# df_subset = pd.read_hdf('large_data.h5', key='prices', where='index > "2024-01-01"')

### 5.7 Choosing the Right Format


<table>
  <tr>
    <th>Format</th>
    <th>Best For</th>
    <th>Typical Size</th>
    <th>Avoid When</th>
  </tr>
  <tr>
    <td>CSV</td>
    <td>Sharing, version control, inspection</td>
    <td>&lt; 100 MB</td>
    <td>Need to preserve dtypes, large datasets</td>
  </tr>
  <tr>
    <td>Excel</td>
    <td>Sharing with non-programmers</td>
    <td>&lt; 50 MB</td>
    <td>Automated workflows, &gt; 1M rows</td>
  </tr>
  <tr>
    <td>Pickle</td>
    <td>Temporary caching within a session</td>
    <td>&lt; 500 MB</td>
    <td>Long-term storage, sharing across environments</td>
  </tr>
  <tr>
    <td>Parquet</td>
    <td>Most data science workflows</td>
    <td>&lt; 10 GB</td>
    <td>Need human-readable format</td>
  </tr>
  <tr>
    <td>HDF5</td>
    <td>Very large datasets, partial loading</td>
    <td>&gt; 10 GB</td>
    <td>Simple projects, small datasets</td>
  </tr>
</table>

### 5.8 A Note on Databases

For larger projects, you might consider a database instead of flat files:

- **SQLite**: File-based database, no server needed, good for single-user applications
- **PostgreSQL**: Full-featured database server, good for multi-user access and complex queries
- **TimescaleDB**: PostgreSQL extension optimized for time-series data

Databases make sense when:

- You have millions of rows of data
- Multiple users or processes need to access the same data
- You need complex queries (joins, aggregations, filtering)
- You need transactional integrity

For the strategies in this book, flat files are sufficient. Database design is beyond our scope, but it is worth knowing these options exist as your projects grow.

In [47]:
# Clean up temporary files
import os
for f in ['gld_prices.csv', 'gld_prices.xlsx', 'gld_prices.pkl', 'gld_prices.parquet']:
    if os.path.exists(f):
        os.remove(f)
print("Temporary files cleaned up.")

Temporary files cleaned up.


---

## Section 6: Putting It Together - A Download Workflow

Let's combine what we have learned into a reusable workflow for downloading and caching data.

### 6.1 A Reusable Download Function

This function downloads data from yfinance, flattens the columns, and saves it to a parquet file. It includes basic error handling and follows the function best practices from Chapter 2.

In [48]:
def download_and_cache(ticker: str,
                       start: str,
                       end: str,
                       cache_dir: str = 'data',
                       auto_adjust: bool = False,
                       refresh: bool = False) -> pd.DataFrame:
    """
    Download stock data from yfinance with local caching.

    Checks for existing cached file before downloading. Includes download
    date in filename for version tracking.

    Parameters:
    -----------
    ticker : str
        Stock ticker symbol (e.g., 'MSFT', 'SPY')
    start : str
        Start date in 'YYYY-MM-DD' format
    end : str
        End date in 'YYYY-MM-DD' format
    cache_dir : str, default 'data'
        Directory to store cached files
    auto_adjust : bool, default False
        Whether to auto-adjust prices for splits and dividends
    refresh : bool, default False
        If True, download fresh data even if cache exists

    Returns:
    --------
    pd.DataFrame
        DataFrame with OHLCV data and flattened columns

    Example:
    --------
    >>> df = download_and_cache('MSFT', '2020-01-01', '2024-12-31')
    >>> df = download_and_cache('MSFT', '2020-01-01', '2024-12-31', refresh=True)
    """
    import os
    from datetime import datetime

    # Create cache directory if it doesn't exist
    os.makedirs(cache_dir, exist_ok=True)

    # Build filename with download date
    today = datetime.now().strftime('%Y-%m-%d')
    filename = f"{cache_dir}/{ticker}_{today}.parquet"

    # Check for cached file (unless refresh is requested)
    if not refresh and os.path.exists(filename):
        print(f"Loading cached data from {filename}")
        return pd.read_parquet(filename)

    # Download fresh data
    print(f"Downloading {ticker}...")
    df = yf.download(ticker, start=start, end=end,
                     auto_adjust=auto_adjust, progress=False)

    # Check if download succeeded
    if df.empty:
        print(f"Warning: No data returned for {ticker}")
        return df

    # Flatten MultiIndex columns
    df = flatten_columns(df)

    # Save to cache
    df.to_parquet(filename)
    print(f"Saved to {filename}")
    print(f"Downloaded {ticker}: {len(df)} rows from {df.index[0].strftime('%Y-%m-%d')} to {df.index[-1].strftime('%Y-%m-%d')}")

    return df

In [49]:
# Test the function
# Test the function
df_test = download_and_cache('MSFT', '2024-01-01', '2024-12-31')
df_test.tail()

# Test the cache - this should load from file, not download
df_test2 = download_and_cache('MSFT', '2024-01-01', '2024-12-31')

# Test the refresh option - this forces a fresh download
df_test3 = download_and_cache('MSFT', '2024-01-01', '2024-12-31', refresh=True)

Downloading MSFT...
Saved to data/MSFT_2026-01-21.parquet
Downloaded MSFT: 251 rows from 2024-01-02 to 2024-12-30
Loading cached data from data/MSFT_2026-01-21.parquet
Downloading MSFT...
Saved to data/MSFT_2026-01-21.parquet
Downloaded MSFT: 251 rows from 2024-01-02 to 2024-12-30


The function handles the complete caching workflow: check for existing data, download if needed, flatten columns, save with a date-stamped filename, and report. The refresh parameter gives you control when you need fresh data regardless of what is cached. Run the cell twice and notice the difference: the first call downloads from yfinance, the second loads instantly from the local file.

**Cleanup**

The test cells above created cache files in the data directory. Let's remove them so the notebook leaves no artifacts behind.

In [52]:
# Clean up
import glob

# Remove test files and any date-stamped cache files
for pattern in ['msft_test.parquet', 'data/*.parquet']:
    for f in glob.glob(pattern):
        os.remove(f)
        print(f"Removed {f}")

# Remove data directory if empty
if os.path.exists('data') and not os.listdir('data'):
    os.rmdir('data')

print("Cleaned up cache files.")

Removed data/MSFT_2026-01-21.parquet
Cleaned up cache files.


---

## Summary

In this notebook, we covered the mechanics of downloading and storing financial data:

1. **API Key Security**: Never hardcode API keys. Use Colab Secrets or environment variables to keep them separate from your code.

2. **yfinance**: Our primary data source for equity and ETF prices. Key parameters include `auto_adjust` (set to `False` to see both `Close` and `Adj Close`), `start`/`end` for date ranges, and `interval` for intraday data. Always apply `flatten_columns()` after downloading single tickers.

3. **Financial Modeling Prep**: A commercial alternative when you need reliability, fundamentals, or support. The free tier is not practical; plan on a paid subscription if you want to use it.

4. **FRED**: Free and reliable source for economic data, but be aware of the traps: frequency mismatch, publication lag, and data revisions can introduce lookahead bias.

5. **Data Storage**: Store data locally for speed and reproducibility. Use parquet as your default format (fast, compressed, version-stable). Use CSV when you need to inspect data manually or share with others. Avoid pickle for long-term storage due to version compatibility issues. Consider HDF5 only for very large datasets that do not fit in memory.

6. **Caching Pattern**: The download_and_cache() function checks for existing data before downloading, includes the download date in filenames for version tracking, and offers a refresh option when you need fresh data.

In the next notebook, we will download the data for our case study (SPY, BIL, VIXY, ^VIX) and prepare it for analysis.