# 1. Data Collection #
The market-aux.py script is responsible for collecting financial news data from the MarketAux platform. This platform aggregates news from various sources, providing a rich dataset that includes up-to-date financial news headlines.

## Historical Market Data Collection ##
Identify Relevant Data: Determine which financial instruments (e.g., stocks, futures, forex) are relevant to your strategy. For the E-mini S&P 500 futures contract example provided, you'll collect price and volume data.

Data Granularity: Decide on the time frame for the data (e.g., 1-minute, 1-hour, daily). Higher-frequency data allows for more granular analysis but requires more storage and processing power.

Historical Depth: Determine how far back you need historical data. Longer historical periods can provide more robust training but may not always reflect current market conditions.

Data Sources: Use the TWS API for market data.

## New Headlines Collection ##
Relevance: Focus on news that directly impacts your financial instruments. For the S&P 500, this might include economic indicators, earnings reports, and major geopolitical events.

Sentiment Analysis: Consider performing sentiment analysis on news headlines to quantify the market sentiment. This can be a feature for your model.

API Integration: Use APIs provided by news sources or financial data providers. Make sure to handle rate limits and request quotas.

## Methodology ##
API Utilization: The script makes use of the MarketAux API to fetch news articles. The API requests are crafted to include parameters such as symbols (e.g., ^GSPC for the S&P 500), language preferences, and entity filtering to ensure that only relevant news articles are retrieved.

Data Scraping and Parsing: Using Python's requests and BeautifulSoup libraries, the script scrapes news headlines, publication dates, and associated metadata. This information is crucial for understanding the context and impact of each news item on market movements.

## Design Choices and Assumptions ##
Scraping Frequency: It is assumed that the data needs to be updated regularly; therefore, the script is designed to be run as often as necessary without overloading the API (compliance with API rate limits).

Data Relevance: The script filters for specific symbols to ensure data relevance to the trading strategies that will be employed.

In [4]:
from ib_insync import IB, Future, util, Order

ib = IB()
ib.connect('127.0.0.1', 7497, clientId=2)

<IB connected to 127.0.0.1:7497 clientId=2>

In [18]:
# Update the contract details as needed
contract = Future(symbol='ES', lastTradeDateOrContractMonth='202403', exchange='CME')

# Attempt to qualify the contract
try:
    qualified_contract = ib.qualifyContracts(contract)[0]
    print("Contract qualified successfully:", qualified_contract)
except Exception as e:
    print(f"Error qualifying contract: {e}")
    qualified_contract = None

if qualified_contract:
    # Request historical data for the qualified contract
    historical_data = ib.reqHistoricalData(
        qualified_contract, endDateTime='', durationStr='30 D',
        barSizeSetting='1 hour', whatToShow='MIDPOINT', useRTH=True)

    if historical_data:
        df = util.df(historical_data)
        print(df.head())
    else:
        print("No historical data returned.")
else:
    print("Unable to request historical data due to contract issues.")

Contract qualified successfully: Future(conId=533620665, symbol='ES', lastTradeDateOrContractMonth='20240315', multiplier='50', exchange='CME', currency='USD', localSymbol='ESH4', tradingClass='ES')
                       date     open     high      low    close  volume  \
0 2024-01-30 08:30:00-06:00  4946.25  4953.75  4946.00  4953.50    -1.0   
1 2024-01-30 09:00:00-06:00  4953.50  4954.00  4944.00  4947.25    -1.0   
2 2024-01-30 10:00:00-06:00  4947.25  4955.75  4945.25  4954.25    -1.0   
3 2024-01-30 11:00:00-06:00  4954.25  4955.75  4941.50  4944.75    -1.0   
4 2024-01-30 12:00:00-06:00  4944.75  4951.25  4942.00  4948.25    -1.0   

   average  barCount  
0     -1.0        -1  
1     -1.0        -1  
2     -1.0        -1  
3     -1.0        -1  
4     -1.0        -1  


In [19]:
ib.reqMarketDataType(3)  # Switch to delayed data if necessary

historical_data = ib.reqHistoricalData(
    contract, endDateTime='', durationStr='30 D',
    barSizeSetting='1 hour', whatToShow='MIDPOINT', useRTH=True)

df = util.df(historical_data)
print(df.head())

                       date     open     high      low    close  volume  \
0 2024-01-30 08:30:00-06:00  4946.25  4953.75  4946.00  4953.50    -1.0   
1 2024-01-30 09:00:00-06:00  4953.50  4954.00  4944.00  4947.25    -1.0   
2 2024-01-30 10:00:00-06:00  4947.25  4955.75  4945.25  4954.25    -1.0   
3 2024-01-30 11:00:00-06:00  4954.25  4955.75  4941.50  4944.75    -1.0   
4 2024-01-30 12:00:00-06:00  4944.75  4951.25  4942.00  4948.25    -1.0   

   average  barCount  
0     -1.0        -1  
1     -1.0        -1  
2     -1.0        -1  
3     -1.0        -1  
4     -1.0        -1  


In [29]:
import requests
from dateutil.parser import parse

def get_headlines(api_token):
    url = "https://api.marketaux.com/v1/news/all?symbols=^GSPC&filter_entities=true&language=en&api_token="+api_token
    
    cleaned_data = []
    for _ in range(3):
        response = requests.get(url)
        news_headlines = response.json()
        for article in news_headlines['data']:
            date = article['published_at']
            title = article['title']
            cleaned_data.append({'date': date, 'headline': title})
    
    return cleaned_data

def extract_headlines(data):
    headlines = {}  # This will store dates with a list of headlines
    
    # Ensure 'data' is a list of dictionaries, each containing an article with 'publishedAt' and 'title'
    for item in data:
        # Parse the publication date of the article and format it as a string
        date = parse(item['date']).date()
        date_str = date.strftime('%Y-%m-%d')
        
        # Ensure there is a list to append headlines for the given date
        if date_str not in headlines:
            headlines[date_str] = []
        
        # Append the headline to the list for the given date
        headlines[date_str].append(item['headline'])
    
    return headlines

In [30]:
api_token = 'jLSvSMQQg0Kk22VWfOiLOqqwjui1e0CZy4gQWsnu'  # Replace with your actual API token
headlines_data = get_headlines(api_token)
organized_headlines = extract_headlines(headlines_data)

# To print the organized headlines:
for date, titles in organized_headlines.items():
    print(f"Date: {date}")
    for title in titles:
        print(f" - {title}")

Date: 2024-03-12
 - S&P 500 Gains and Losses Today: Oracle Surges as Results Show AI Advancement
 - U.S. Stock Market Optimism Amid Inflation Concerns
 - Market Talk – March 12, 2024
 - S&P 500 Gains and Losses Today: Oracle Surges as Results Show AI Advancement
 - U.S. Stock Market Optimism Amid Inflation Concerns
 - Market Talk – March 12, 2024
 - S&P 500 Gains and Losses Today: Oracle Surges as Results Show AI Advancement
 - U.S. Stock Market Optimism Amid Inflation Concerns
 - Market Talk – March 12, 2024


Peer closed connection.


# 2. Feature Extraction & Preprocessing #

The script sentiment-analysis-data.py handles the preprocessing and feature extraction needed to prepare the scraped data for model training.

## Text Preprocessing
**HTML Tag Removal**: Financial news data scraped from the web often contains HTML markup. Removing HTML tags is crucial for obtaining clean text that contains only the content of interest. This is achieved using regular expressions that identify and strip out HTML tags.

**Lowercasing**: Converting all text to lowercase ensures uniformity across the dataset. This is important because many text processing tools treat words differently based on case (e.g., "Bank" vs. "bank"). Lowercasing helps in reducing the feature space and improves the robustness of the model.

**Tokenization**: This process splits text into individual words or phrases. It is a fundamental step in text analysis, as it turns a string of characters into tokens which can be individually analyzed. Word tokenization is performed using libraries like NLTK, which can handle complex word separation rules.

**Stopword Removal**: Stopwords are common words (such as "the", "is", and "in") that appear frequently across texts but carry little meaningful information for analysis. Removing these words helps in focusing on words that have more predictive power for the sentiment analysis.

**Special Character Removal**: Non-alphanumeric characters and punctuation are removed because they are generally not useful in understanding the sentiment of a text. This also helps in reducing the number of unique tokens the model needs to handle.

**Stemming and Lemmatization**: These techniques reduce words to their root or base form. For instance, "stocks", "stocker", and "stocking" might all be reduced to "stock". This helps in aggregating the sentiment expressed towards similar concepts. Stemming is faster but cruder as it chops off word ends, while lemmatization is more sophisticated, using vocabulary and morphological analysis.

## Feature Extraction
**Word Embeddings (Word2Vec)**: Instead of using traditional count-based methods like Bag-of-Words (BoW) or TF-IDF, which result in sparse and high-dimensional vectors, Word2Vec creates dense vectors for each word. These vectors capture semantic meanings, where similar words have similar encoding. Word2Vec uses neural networks to learn word associations from a large corpus of text. In this project, a Word2Vec model is trained on the preprocessed news headlines to capture the context within the financial news domain.

**Vector Averaging for Sentences**: After converting words into vectors, sentence-level features are created by averaging the word vectors for each sentence (or document). This method preserves the semantic of the entire sentence and reduces the dimensionality of the data, which is beneficial for the efficiency of subsequent modeling steps.

**Dimensionality Reduction and Normalization**: Additional steps can include reducing the dimensionality further using techniques like PCA (Principal Component Analysis) if necessary, and normalizing features to ensure that model weights are initialized and updated uniformly during training.


## Algorithms/Strategies Used
Word2Vec for Semantic Analysis: This method is chosen for its ability to capture contextual relationships between words, making it suitable for sentiment analysis.

## Design Choices and Assumptions
Vectorization: The choice to use Word2Vec over other methods like TF-IDF or direct count vectors is based on the need for dense representations that support nuanced semantic analysis.

# 3. Model Development #
The sentiment_model.py script develops a neural network to perform sentiment analysis based on the features extracted from the news headlines.

## Neural Network Architecture: 
A convolutional neural network (CNN) is used for its efficacy in handling sequential input data (text).

**Model Selection**: The model chosen for this task is a Convolutional Neural Network (CNN). While CNNs are traditionally known for image processing tasks, they are also highly effective for sequence processing of text data due to their ability to extract higher-order features through convolutional layers.

**Input Layer**: The input to the model consists of fixed-size vectors obtained from the Word2Vec model. These vectors undergo preprocessing to ensure they are suitable for neural network processing, including normalization.

**Convolutional Layers**: The model uses 1D convolutional layers which are ideal for extracting features from sequences. In the context of text, these layers slide over word embeddings, capturing local dependencies and semantic patterns.

**Pooling Layers**: Following convolutional layers, max pooling layers are used to reduce the dimensionality of the data, which helps in reducing the computational load and also in extracting the most dominant features which are robust against the noise in the data.

**Dropout Layers**: Dropout is implemented as a regularization method to prevent overfitting. By randomly dropping units in the neural network during training, it forces the network to learn robust features that are useful in conjunction with many different random subsets of the other neurons.

**Fully Connected Layers**: The output from the convolutional and pooling layers is flattened and fed into fully connected (dense) layers. These layers synthesize the features extracted by the convolutions to make a final prediction regarding the sentiment.

**Output Layer**: The output layer consists of neurons equal to the number of sentiment classes (positive, neutral, negative). It typically uses a softmax activation function to output a probability distribution over the classes.
Training Process: The model is trained using cross-entropy loss, with Adam optimizer and a learning rate scheduler to improve training dynamics.

## Training and Validation
**Loss Function**: Cross-entropy loss is used, which is standard for classification tasks. It measures the performance of a classification model whose output is a probability value between 0 and 1.

**Optimizer**: The Adam optimizer is chosen for its effectiveness in handling sparse gradients and its adaptiveness in updating weights, which is beneficial for quickly converging in training deep neural networks.

**Batch Processing and Iterations**: The model is trained using mini-batch gradient descent, which balances the speed of stochastic gradient descent and the efficiency of batch gradient descent. The number of epochs and batch size are tuned based on the performance on a validation set.

**Learning Rate Scheduling**: A learning rate scheduler is employed to adjust the learning rate during training, which can help in finding a better global minimum during training.

**Validation Strategy**: The dataset is split into training and validation sets. The validation set is used to tune the hyperparameters and to prevent overfitting. Early stopping is also implemented to stop training when the validation loss ceases to decrease, thereby saving computational resources and preventing overfitting.

## Algorithms/Strategies Used
CNN for Text Classification: Chosen for its ability to extract higher-order features from text data, which is crucial in understanding the sentiment expressed in news headlines.

Feature Representation: It is assumed that the Word2Vec embeddings provide a rich enough representation of text to capture semantic meanings necessary for sentiment analysis.

Generalization Capability: The architecture is designed to generalize well to unseen data by employing techniques such as dropout and early stopping based on the validation set performance.

## Design Choices and Assumptions
Data Splitting: The data is split into training and testing sets to evaluate the model's performance on unseen data, ensuring that it generalizes well.

# 4. Strategy Simulation #
Simulation of trading strategies based on the output of the sentiment analysis model is conducted to test the potential efficacy of trading decisions influenced by news sentiment.

## Design and Implementation of Trading Strategies
**Strategy Design Based on Sentiment Scores**: The core of the simulation involves creating rules that translate sentiment scores into actionable trading signals. For example:

**Positive Sentiment**: If the sentiment analysis outputs a strong positive score, the strategy might consider this an indicator to buy or hold a position, anticipating upward price movement.

**Negative Sentiment**: Conversely, a strong negative sentiment might trigger a sell signal or short-selling, anticipating downward movement.

**Neutral Sentiment**: In cases of neutral sentiment, the strategy might opt to hold its current positions or seek other indicators for decision-making.

**Threshold Setting**: Critical to the strategy are the thresholds set for interpreting the sentiment scores. These thresholds determine how strongly positive or negative sentiment must be to trigger buying or selling actions.

**Combining Multiple Indicators**: Beyond sentiment, the strategy can integrate other technical indicators (e.g., moving averages, volatility indices) to confirm signals or mitigate risks, thus creating a more robust trading system.

## Simulation Environment and Backtesting
**Simulation Tools and Platforms**: The strategies are tested using a trading simulation platform that mimics the behavior of financial markets. This platform allows the integration of the trading strategy logic coded in Python, leveraging historical market data to simulate trades.

**Data Handling**: Historical market data, including price, volume, and perhaps other market indicators, are used alongside the sentiment data to test the strategy. This data must be aligned in time with the sentiment analysis outputs to ensure that the simulation reflects realistic trading scenarios.

**Backtesting Framework**: The backtesting involves running the trading strategy against historical data to evaluate its effectiveness. Key metrics such as return on investment, maximum drawdown, and the Sharpe ratio are calculated to assess performance.

## Risk Management and Optimization
**Risk Considerations**: The simulation incorporates risk management protocols, such as setting stop-loss orders, diversifying holdings, and limiting exposure per trade. These measures are designed to protect against large losses in adverse market conditions.

**Optimization Techniques**: Parameters within the strategy, such as sentiment thresholds and the parameters of other technical indicators, are optimized based on historical performance. Techniques such as grid search, random search, or even machine learning methods like reinforcement learning can be employed to find the optimal settings.

## Evaluation and Iterative Improvement
**Performance Evaluation**: After backtesting, the strategy's performance is analyzed in depth. This includes studying the strategy's ability to capitalize on market opportunities and its resilience during market downturns.

**Iterative Refinement**: Based on the outcomes of the initial simulations, the strategy may undergo iterative refinements to improve its performance or adapt to new market insights. This iterative process is crucial as it allows the strategy to evolve in response to feedback from its performance in simulated environments.

# 5. Integration with TWS for Simulated Trading #
The tws.py script integrates the trading strategies with the Interactive Brokers Trader Workstation (TWS) for simulated trading.

## Methodology
API Connection: Establishes a connection to TWS using the ib_insync library, which allows for programmatic control over trading actions.
Order Execution: Based on the sentiment analysis, trades are executed in a simulated environment to test the integration and operational efficiency of the system.

## Design Choices and Assumptions
Use of Simulated Environment: The use of a simulated trading environment is crucial for safely testing the system without financial risk.


# 6. Documentation #