# Alternative Data Integration for Quantitative Finance

**Author**: Kevin J. Metzler  
**Date**: August 6, 2025

This notebook demonstrates the integration of alternative data sources with traditional market data for enhanced signal generation in quantitative trading strategies. We explore how news sentiment, social media data, and other non-traditional data sources can provide additional alpha when properly incorporated into quantitative models.

## 1. Introduction to Alternative Data in Finance

Alternative data refers to non-traditional data sources that can provide insights into financial markets beyond standard market data. These include:

- **News and Social Media**: Sentiment analysis from news articles, tweets, and forum discussions
- **Satellite Imagery**: Analysis of parking lots, shipping activity, and agricultural fields
- **Web Traffic**: Consumer interest measured through website visits and app usage
- **Credit Card Transactions**: Consumer spending patterns and trends
- **Geolocation Data**: Foot traffic at retail locations and other venues

In this notebook, we'll focus on implementing news sentiment analysis and its integration with traditional market data.

In [None]:
# Import necessary libraries
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from dotenv import load_dotenv

# Add parent directory to path
sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath('__file__'))))

# Import our alternative data module
from alternative_data import AlternativeDataIntegration, AlternativeDataConfig, NewsSentimentAnalyzer

# Configure plots
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)
plt.rcParams['axes.grid'] = True

# Load environment variables for API keys
load_dotenv()

# Check if API keys are available
news_api_key = os.getenv("NEWS_API_KEY")
if not news_api_key:
    print("WARNING: NEWS_API_KEY not found. Using simulated data.")
    using_simulated_data = True
else:
    using_simulated_data = False

## 2. Fetching Market Data

Let's start by fetching market data for analysis. We'll use the `yfinance` library to obtain price data for selected securities.

In [None]:
import yfinance as yf

# Define the securities we want to analyze
tickers = ['AAPL', 'MSFT', 'GOOGL', 'AMZN', 'NVDA']

# Define time period
start_date = '2023-01-01'
end_date = '2023-12-31'

# Fetch market data
market_data = {}

for ticker in tickers:
    try:
        data = yf.download(ticker, start=start_date, end=end_date, progress=False)
        
        # Remove the multi-index structure for simpler processing
        if isinstance(data.columns, pd.MultiIndex):
            data.columns = [col[0] for col in data.columns]
        
        # Calculate returns
        data['return'] = data['Close'].pct_change()
        data['log_return'] = np.log(data['Close'] / data['Close'].shift(1))
        
        # Calculate volatility
        data['volatility_20d'] = data['return'].rolling(20).std() * np.sqrt(252)
        
        market_data[ticker] = data
        print(f"Fetched {len(data)} days of data for {ticker}")
        
    except Exception as e:
        print(f"Error fetching data for {ticker}: {e}")

# Show sample of the data
if market_data:
    sample_ticker = tickers[0]
    print(f"\nSample data for {sample_ticker}:")
    print(market_data[sample_ticker].head())

## 3. News Sentiment Analysis

Now we'll implement news sentiment analysis for our selected securities using the NewsAPI. If the API key is not available, we'll generate simulated sentiment data for demonstration purposes.

In [None]:
# Initialize the alternative data integrator
alt_data = AlternativeDataIntegration(data_dir="../data/alternative")

# Function to create simulated sentiment data if needed
def create_simulated_sentiment_data(ticker, data):
    """
    Create simulated sentiment data for demonstration purposes when API keys are not available.
    This uses price momentum and adds random noise to simulate sentiment.
    """
    # Calculate price momentum
    price_momentum = data['Close'].pct_change(5).fillna(0)
    
    # Create sentiment with momentum and noise
    np.random.seed(42)  # For reproducibility
    noise = pd.Series(np.random.normal(0, 0.005, len(data)), index=data.index)
    sentiment = price_momentum + noise
    
    # Scale to sentiment range (-1 to 1)
    sentiment = np.tanh(sentiment * 10)
    
    # Create sentiment DataFrame
    sentiment_df = pd.DataFrame({
        'compound': sentiment,
        'positive': (sentiment + 1) / 2,  # Scale to 0-1
        'negative': (1 - sentiment) / 2,  # Scale to 0-1
    }, index=data.index)
    
    # Add rolling metrics
    window = 30  # days
    sentiment_df[f'compound_{window}d_avg'] = sentiment_df['compound'].rolling(window).mean()
    sentiment_df[f'positive_{window}d_avg'] = sentiment_df['positive'].rolling(window).mean()
    sentiment_df[f'negative_{window}d_avg'] = sentiment_df['negative'].rolling(window).mean()
    sentiment_df['sentiment_momentum'] = sentiment_df[f'compound_{window}d_avg'].diff()
    
    # Create topic data (simulated)
    for i in range(5):  # 5 simulated topics
        topic_noise = pd.Series(np.random.normal(0, 0.1, len(data)), index=data.index)
        sentiment_df[f'topic_{i}'] = topic_noise + sentiment_df['compound'] * np.random.uniform(0.5, 1.5)
    
    return sentiment_df

# Process each ticker
enhanced_data = {}

for ticker in tickers:
    if ticker not in market_data:
        continue
        
    print(f"\nProcessing alternative data for {ticker}...")
    
    if using_simulated_data:
        # Use simulated data
        sentiment_data = create_simulated_sentiment_data(ticker, market_data[ticker])
        enhanced_data[ticker] = pd.concat([market_data[ticker], sentiment_data], axis=1)
        print(f"  Created simulated sentiment data for {ticker}")
    else:
        # Use real data from NewsAPI
        enhanced_data[ticker] = alt_data.integrate_all_sources(
            ticker, 
            market_data[ticker], 
            start_date, 
            end_date
        )
        print(f"  Integrated alternative data for {ticker}")

# Show sample of enhanced data
if enhanced_data:
    sample_ticker = tickers[0]
    print(f"\nSample enhanced data for {sample_ticker}:")
    # Show just a few columns to avoid cluttering the output
    display_cols = ['Close', 'return', 'compound', 'compound_30d_avg', 'sentiment_momentum']
    display_cols = [col for col in display_cols if col in enhanced_data[sample_ticker].columns]
    print(enhanced_data[sample_ticker][display_cols].head())

## 4. Visualizing Sentiment and Price Relationships

Let's visualize the relationship between news sentiment and price movements to better understand how sentiment might influence or reflect market behavior.

In [None]:
def plot_sentiment_price_relationship(ticker, data):
    """Plot the relationship between sentiment and price for a given ticker."""
    if 'compound_30d_avg' not in data.columns or 'Close' not in data.columns:
        print(f"Required columns not found for {ticker}")
        return
    
    # Create figure with two y-axes
    fig, ax1 = plt.subplots(figsize=(14, 8))
    ax2 = ax1.twinx()
    
    # Plot price on primary y-axis
    ax1.plot(data.index, data['Close'], 'b-', label=f"{ticker} Price")
    ax1.set_xlabel('Date')
    ax1.set_ylabel('Price ($)', color='b')
    ax1.tick_params(axis='y', labelcolor='b')
    
    # Plot sentiment on secondary y-axis
    ax2.plot(data.index, data['compound_30d_avg'], 'g-', label='30-day Sentiment')
    ax2.set_ylabel('Sentiment Score', color='g')
    ax2.tick_params(axis='y', labelcolor='g')
    ax2.axhline(y=0, color='r', linestyle='--', alpha=0.3)
    
    # Add a title with correlation coefficient
    corr = data[['Close', 'compound_30d_avg']].corr().iloc[0, 1]
    plt.title(f"{ticker}: Price vs. News Sentiment (Correlation: {corr:.3f})")
    
    # Add legend
    lines1, labels1 = ax1.get_legend_handles_labels()
    lines2, labels2 = ax2.get_legend_handles_labels()
    ax1.legend(lines1 + lines2, labels1 + labels2, loc='upper left')
    
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()
    
    # Return the correlation for further analysis
    return corr

# Plot for each ticker
correlations = {}
for ticker in enhanced_data:
    print(f"\nAnalyzing sentiment-price relationship for {ticker}")
    corr = plot_sentiment_price_relationship(ticker, enhanced_data[ticker])
    correlations[ticker] = corr

# Compare correlations across tickers
if correlations:
    corr_df = pd.DataFrame(list(correlations.items()), columns=['Ticker', 'Correlation'])
    corr_df = corr_df.sort_values('Correlation', ascending=False).reset_index(drop=True)
    
    plt.figure(figsize=(10, 6))
    sns.barplot(x='Ticker', y='Correlation', data=corr_df)
    plt.title('Sentiment-Price Correlation by Ticker')
    plt.axhline(y=0, color='r', linestyle='--', alpha=0.3)
    plt.tight_layout()
    plt.show()

## 5. Creating a Sentiment-Enhanced Trading Signal

Now, let's develop a simple trading strategy that incorporates sentiment data along with traditional price signals.

In [None]:
def generate_sentiment_enhanced_signals(data, sentiment_threshold=0.1, momentum_lookback=20):
    """Generate trading signals based on price momentum and sentiment."""
    signals = pd.DataFrame(index=data.index)
    
    # Calculate price momentum
    signals['price_momentum'] = data['Close'].pct_change(momentum_lookback)
    
    # Get sentiment signals
    if 'compound_30d_avg' in data.columns:
        signals['sentiment'] = data['compound_30d_avg']
    else:
        signals['sentiment'] = 0  # Neutral if no sentiment data
    
    # Generate base signals (price momentum only)
    signals['base_signal'] = 0
    signals.loc[signals['price_momentum'] > 0.05, 'base_signal'] = 1  # Long
    signals.loc[signals['price_momentum'] < -0.05, 'base_signal'] = -1  # Short
    
    # Generate enhanced signals (price momentum + sentiment)
    signals['enhanced_signal'] = signals['base_signal'].copy()
    
    # Amplify signals when sentiment agrees with momentum
    signals.loc[(signals['base_signal'] > 0) & (signals['sentiment'] > sentiment_threshold), 'enhanced_signal'] = 2
    signals.loc[(signals['base_signal'] < 0) & (signals['sentiment'] < -sentiment_threshold), 'enhanced_signal'] = -2
    
    # Reduce signals when sentiment contradicts momentum
    signals.loc[(signals['base_signal'] > 0) & (signals['sentiment'] < -sentiment_threshold), 'enhanced_signal'] = 0
    signals.loc[(signals['base_signal'] < 0) & (signals['sentiment'] > sentiment_threshold), 'enhanced_signal'] = 0
    
    return signals

def backtest_strategy(data, signals, initial_capital=10000, transaction_cost=0.001):
    """Simple backtesting of a trading strategy."""
    # Initialize portfolio and positions
    portfolio = pd.DataFrame(index=signals.index)
    portfolio['holdings'] = 0.0
    portfolio['cash'] = initial_capital
    portfolio['total'] = initial_capital
    portfolio['returns'] = 0.0
    portfolio['signal'] = signals
    portfolio['position'] = 0
    
    # Position sizing - simple implementation
    position_size = initial_capital * 0.95  # Use 95% of capital
    
    for i in range(1, len(portfolio)):
        # Get previous positions and current price
        prev_position = portfolio['position'].iloc[i-1]
        prev_holdings = portfolio['holdings'].iloc[i-1]
        prev_cash = portfolio['cash'].iloc[i-1]
        price = data['Close'].iloc[i]
        prev_price = data['Close'].iloc[i-1]
        
        # Calculate position change based on signal
        signal = signals.iloc[i]
        new_position = signal
        
        # If position changes, execute trade
        if new_position != prev_position:
            # Close out previous position
            if prev_position != 0:
                # Sell position
                trade_value = prev_holdings * (1 - transaction_cost)
                portfolio['cash'].iloc[i] = prev_cash + trade_value
                portfolio['holdings'].iloc[i] = 0
            else:
                portfolio['cash'].iloc[i] = prev_cash
                portfolio['holdings'].iloc[i] = 0
            
            # Open new position if applicable
            if new_position != 0:
                # Calculate number of shares based on signal strength
                signal_strength = abs(new_position)
                trade_value = position_size * signal_strength / 2
                
                # Adjust for direction (long/short)
                if new_position > 0:  # Long position
                    shares = trade_value / price
                    cost = shares * price * (1 + transaction_cost)
                    portfolio['cash'].iloc[i] -= cost
                    portfolio['holdings'].iloc[i] = shares * price
                else:  # Short position
                    shares = trade_value / price
                    proceeds = shares * price * (1 - transaction_cost)
                    portfolio['cash'].iloc[i] += proceeds
                    portfolio['holdings'].iloc[i] = -shares * price
        else:
            # Position unchanged, update holdings value
            if prev_position != 0:
                # Update holdings based on price change
                if prev_position > 0:  # Long position
                    shares = prev_holdings / prev_price
                    portfolio['holdings'].iloc[i] = shares * price
                else:  # Short position
                    shares = abs(prev_holdings / prev_price)
                    portfolio['holdings'].iloc[i] = -shares * price
            else:
                portfolio['holdings'].iloc[i] = 0
            
            portfolio['cash'].iloc[i] = prev_cash
        
        # Update position
        portfolio['position'].iloc[i] = new_position
        
        # Calculate portfolio value
        portfolio['total'].iloc[i] = portfolio['cash'].iloc[i] + portfolio['holdings'].iloc[i]
        
        # Calculate returns
        portfolio['returns'].iloc[i] = portfolio['total'].iloc[i] / portfolio['total'].iloc[i-1] - 1
    
    # Calculate performance metrics
    portfolio['cumulative_returns'] = (1 + portfolio['returns']).cumprod() - 1
    annual_return = portfolio['returns'].mean() * 252
    annual_vol = portfolio['returns'].std() * np.sqrt(252)
    sharpe_ratio = annual_return / annual_vol if annual_vol > 0 else 0
    max_drawdown = (portfolio['total'] / portfolio['total'].cummax() - 1).min()
    
    # Store metrics
    metrics = {
        'annual_return': annual_return,
        'annual_volatility': annual_vol,
        'sharpe_ratio': sharpe_ratio,
        'max_drawdown': max_drawdown,
        'final_value': portfolio['total'].iloc[-1],
        'total_return': portfolio['total'].iloc[-1] / initial_capital - 1
    }
    
    return portfolio, metrics

# Run backtest for each ticker
backtest_results = {}

for ticker in enhanced_data:
    print(f"\nBacktesting strategies for {ticker}...")
    
    data = enhanced_data[ticker]
    
    # Generate signals
    base_signals = generate_sentiment_enhanced_signals(data)['base_signal']
    enhanced_signals = generate_sentiment_enhanced_signals(data)['enhanced_signal']
    
    # Run backtests
    base_portfolio, base_metrics = backtest_strategy(data, base_signals)
    enhanced_portfolio, enhanced_metrics = backtest_strategy(data, enhanced_signals)
    
    # Store results
    backtest_results[ticker] = {
        'base': {
            'portfolio': base_portfolio,
            'metrics': base_metrics
        },
        'enhanced': {
            'portfolio': enhanced_portfolio,
            'metrics': enhanced_metrics
        }
    }
    
    # Print performance comparison
    print(f"\n{ticker} Performance Comparison:")
    print(f"  Base Strategy:")
    print(f"    Annual Return: {base_metrics['annual_return']:.2%}")
    print(f"    Sharpe Ratio: {base_metrics['sharpe_ratio']:.2f}")
    print(f"    Max Drawdown: {base_metrics['max_drawdown']:.2%}")
    print(f"    Total Return: {base_metrics['total_return']:.2%}")
    print(f"\n  Enhanced Strategy (with Sentiment):")
    print(f"    Annual Return: {enhanced_metrics['annual_return']:.2%}")
    print(f"    Sharpe Ratio: {enhanced_metrics['sharpe_ratio']:.2f}")
    print(f"    Max Drawdown: {enhanced_metrics['max_drawdown']:.2%}")
    print(f"    Total Return: {enhanced_metrics['total_return']:.2%}")
    
    # Plot equity curves
    plt.figure(figsize=(14, 7))
    plt.plot(base_portfolio.index, base_portfolio['total'], label='Base Strategy')
    plt.plot(enhanced_portfolio.index, enhanced_portfolio['total'], label='Enhanced Strategy')
    plt.title(f"{ticker}: Equity Curve Comparison")
    plt.xlabel('Date')
    plt.ylabel('Portfolio Value ($)')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 6. Feature Importance Analysis

Finally, let's analyze the importance of alternative data features in predicting returns.

In [None]:
# Analyze feature importance
feature_importance = {}

for ticker in enhanced_data:
    print(f"\nAnalyzing feature importance for {ticker}...")
    
    data = enhanced_data[ticker].copy()
    
    # Calculate next day returns as target
    data['next_return'] = data['return'].shift(-1)
    
    # Get feature importance
    try:
        importance = alt_data.create_feature_importance_analysis(data, target_col='next_return')
        feature_importance[ticker] = importance
        
        # Plot top features
        top_features = importance['top_features']
        top_importances = [importance['importance'][importance['feature_names'].index(f)] for f in top_features]
        
        plt.figure(figsize=(12, 6))
        y_pos = range(len(top_features))
        plt.barh(y_pos, top_importances, align='center')
        plt.yticks(y_pos, top_features)
        plt.xlabel('Importance')
        plt.title(f"{ticker}: Feature Importance for Return Prediction")
        plt.tight_layout()
        plt.show()
        
        print(f"  R² Score: {importance['r2_score']:.4f}")
        print(f"  Top 5 features: {', '.join(top_features[:5])}")
        
    except Exception as e:
        print(f"  Error analyzing feature importance: {e}")

## 7. Conclusion

In this notebook, we've explored the integration of alternative data sources with traditional market data for enhanced signal generation in quantitative trading strategies. The key findings include:

1. **Sentiment Impact**: News sentiment can provide additional signals that complement traditional price-based indicators.

2. **Enhanced Performance**: Strategies incorporating sentiment data generally showed improved risk-adjusted returns compared to base strategies.

3. **Feature Importance**: Several alternative data features ranked among the top predictors for future returns, demonstrating their potential value.

### Further Research Opportunities

- Integration of additional alternative data sources such as satellite imagery and web traffic
- More sophisticated NLP techniques for sentiment analysis, including topic modeling and entity extraction
- Exploration of non-linear relationships between sentiment and returns using advanced machine learning models
- Investigation of sentiment lead/lag relationships with different asset classes and market regimes