**STEP 1 - Global Imports and Installs**

In [25]:
# Install required packages (run in a Jupyter cell if not already installed)
!pip install gymnasium pandas numpy matplotlib scikit-learn requests

# Import required libraries
import gymnasium as gym
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import requests
import random
import json
import warnings

warnings.filterwarnings("ignore")




**STEP 2 - Simulating Daily Financial News for Testing**

In this step, we create a simple DataFrame named `news_data` that simulates five consecutive days of financial news headlines related to a single stock — in this case, Apple (AAPL). Each entry includes:

- A `date` field ranging from January 1, 2023, to January 5, 2023
- A `headline` field representing a simplified news article or market update for the day

This simulated dataset serves as the input for our generative agent, which will later interpret each headline using an LLM and make trading decisions based on the inferred sentiment.

This controlled input allows us to test the agent’s reasoning and decision-making without relying on external data sources in the early development phase.


In [26]:
# Simulated daily financial news for a single stock (e.g., AAPL)
news_data = pd.DataFrame({
    "date": pd.date_range(start="2023-01-01", periods=5, freq="D"),
    "headline": [
        "Apple announces record-breaking Q1 earnings",
        "Concerns grow over global chip shortages affecting tech stocks",
        "Apple stock downgraded by major investment bank",
        "New iPhone launch receives strong pre-orders",
        "Federal Reserve hints at possible interest rate hikes"
    ]
})

news_data


Unnamed: 0,date,headline
0,2023-01-01,Apple announces record-breaking Q1 earnings
1,2023-01-02,Concerns grow over global chip shortages affec...
2,2023-01-03,Apple stock downgraded by major investment bank
3,2023-01-04,New iPhone launch receives strong pre-orders
4,2023-01-05,Federal Reserve hints at possible interest rat...


**Defining a Function to Query Ollama LLM for Headline Interpretation**

This function `query_ollama()` is responsible for sending a financial news headline to a locally running Ollama LLM (such as Mistral or LLaMA) and retrieving its assessment of the likely short-term market impact on Apple stock.

**Key elements of the function:**

- **Input Parameters:**
  - `headline`: A string containing the financial news headline.
  - `model`: Name of the local LLM to query via Ollama (default is `"mistral"`).

- **Prompt Construction:**
  - The prompt explicitly instructs the model to evaluate the headline and return one of three categorical outputs: `'UP'`, `'DOWN'`, or `'NEUTRAL'`.

- **HTTP POST Request:**
  - A call is made to `http://localhost:11434/api/generate` with the prompt and model name using the Ollama API.

- **Output Handling:**
  - The model's response is extracted, trimmed, and capitalized.
  - If the response is not one of the expected categories, it defaults to `'NEUTRAL'`.

- **Error Handling:**
  - Any exception (e.g., connection error or JSON decoding failure) is caught and logged, and the function returns `'NEUTRAL'` as a fallback.

This function bridges the LLM’s natural language understanding with the reinforcement learning environment, providing sentiment-derived insights for decision-making.


In [27]:
def query_ollama(headline, model="mistral"):
    prompt = (
        f"Given the following financial news headline about Apple stock:\n\n"
        f"'{headline}'\n\n"
        f"Please analyze the likely short-term impact on Apple stock price. "
        f"Respond with one of the following directions ONLY: 'UP', 'DOWN', or 'NEUTRAL'."
    )
    
    try:
        response = requests.post(
            url=f"http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt, "stream": False}
        )
        result = response.json()
        direction = result["response"].strip().upper()
        if direction not in {"UP", "DOWN", "NEUTRAL"}:
            direction = "NEUTRAL"
        return direction
    except Exception as e:
        print(f"Ollama call failed: {e}")
        return "NEUTRAL"


**Querying Ollama to Interpret Each Headline**

In this step, we apply the `query_ollama()` function to each row in the `headline` column of our `news_data` DataFrame.

- The function sends each headline to the locally running Ollama model.
- The LLM analyzes the likely short-term market impact and returns one of three directional labels: `'UP'`, `'DOWN'`, or `'NEUTRAL'`.
- The result is stored in a new column called `direction`.

This labeled dataset now contains both the original news and the model’s inferred sentiment, which will be used as input features for the reinforcement learning environment.


In [28]:
# Query Ollama for each headline
news_data["direction"] = news_data["headline"].apply(query_ollama)

news_data


Unnamed: 0,date,headline,direction
0,2023-01-01,Apple announces record-breaking Q1 earnings,UP
1,2023-01-02,Concerns grow over global chip shortages affec...,NEUTRAL
2,2023-01-03,Apple stock downgraded by major investment bank,NEUTRAL
3,2023-01-04,New iPhone launch receives strong pre-orders,NEUTRAL
4,2023-01-05,Federal Reserve hints at possible interest rat...,NEUTRAL


**Simulating Stock Price Changes Based on LLM-Inferred Direction**

To evaluate the agent's decisions, we simulate daily percentage changes in Apple’s stock price based on the sentiment labels (`'UP'`, `'DOWN'`, `'NEUTRAL'`) generated by the LLM.

- A fixed random seed (`np.random.seed(42)`) ensures reproducibility.
- For each directional label:
  - `'UP'` leads to a small positive return (0.5% to 2.0%)
  - `'DOWN'` leads to a small negative return (-2.0% to -0.5%)
  - `'NEUTRAL'` results in a flat or mixed return (-0.5% to 0.5%)

The simulated percentage change is added as a new column `price_change_pct` in the `news_data` DataFrame. This column will be used to compute rewards within the trading environment.


In [29]:
# Simulate stock returns (in %): positive, negative, or flat
np.random.seed(42)
price_change = []

for direction in news_data["direction"]:
    if direction == "UP":
        change = np.random.uniform(0.5, 2.0)   # small gain
    elif direction == "DOWN":
        change = np.random.uniform(-2.0, -0.5) # small loss
    else:
        change = np.random.uniform(-0.5, 0.5)  # neutral
    price_change.append(round(change, 2))

news_data["price_change_pct"] = price_change

news_data


Unnamed: 0,date,headline,direction,price_change_pct
0,2023-01-01,Apple announces record-breaking Q1 earnings,UP,1.06
1,2023-01-02,Concerns grow over global chip shortages affec...,NEUTRAL,0.45
2,2023-01-03,Apple stock downgraded by major investment bank,NEUTRAL,0.23
3,2023-01-04,New iPhone launch receives strong pre-orders,NEUTRAL,0.1
4,2023-01-05,Federal Reserve hints at possible interest rat...,NEUTRAL,-0.34


**Creating a Custom Trading Environment with Gymnasium**

This class `NewsTradingEnv` defines a custom OpenAI Gym-compatible environment for training a reinforcement learning agent to make stock trading decisions based on news sentiment.

**Key components:**

- **Initialization (`__init__`)**
  - Takes in a `news_df` DataFrame containing news headlines and simulated price changes.
  - Defines:
    - `action_space` with 3 discrete choices:  
      - `0`: SELL  
      - `1`: HOLD  
      - `2`: BUY
    - `observation_space` as a 3-element one-hot encoded vector indicating sentiment (`UP`, `DOWN`, `NEUTRAL`).

- **Reset (`reset`)**
  - Called at the start of every new episode.
  - Resets `current_step` to 0 and returns the first observation.

- **Observation Getter (`_get_obs`)**
  - Converts the LLM-derived sentiment (`direction`) at the current step into a one-hot vector:
    - `[1, 0, 0]` for UP
    - `[0, 1, 0]` for DOWN
    - `[0, 0, 1]` for NEUTRAL

- **Step Function (`step`)**
  - Receives an `action` from the agent.
  - Computes the `reward` based on the agent's action and the actual price change:
    - If the agent correctly buys before a gain or sells before a loss, it gets a positive reward.
    - Holding results in a small penalty based on the volatility.
  - Moves to the next step, and returns the new observation, reward, and `done` flag indicating the end of the episode.

- **Render (`render`)**
  - Simply prints the current time step for basic debugging.

This environment allows the agent to learn trading behavior from simplified financial scenarios using reinforcement learning techniques.


In [30]:
import gymnasium as gym
from gymnasium import spaces

class NewsTradingEnv(gym.Env):
    def __init__(self, news_df):
        super().__init__()
        self.news_df = news_df
        self.max_steps = len(news_df)
        self.current_step = 0

        # Action space: 0 = SELL, 1 = HOLD, 2 = BUY
        self.action_space = spaces.Discrete(3)

        # Observation space: one-hot direction vector (UP, DOWN, NEUTRAL)
        self.observation_space = spaces.MultiBinary(3)

    def reset(self, seed=None, options=None):
        self.current_step = 0
        return self._get_obs(), {}

    def _get_obs(self):
        direction = self.news_df.iloc[self.current_step]["direction"]
        if direction == "UP":
            return np.array([1, 0, 0])
        elif direction == "DOWN":
            return np.array([0, 1, 0])
        else:
            return np.array([0, 0, 1])

    def step(self, action):
        row = self.news_df.iloc[self.current_step]
        actual_change = row["price_change_pct"]

        # Reward logic
        reward = 0
        if action == 2 and actual_change > 0:  # BUY and price goes up
            reward = actual_change
        elif action == 0 and actual_change < 0:  # SELL and price drops
            reward = -actual_change  # positive reward
        elif action == 1:  # HOLD
            reward = -abs(actual_change) * 0.1  # small penalty or bonus

        self.current_step += 1
        done = self.current_step >= self.max_steps
        obs = self._get_obs() if not done else np.array([0, 0, 0])

        return obs, reward, done, False, {}

    def render(self):
        print(f"Step {self.current_step}")


In [31]:
import gymnasium as gym
from gymnasium import spaces
import numpy as np

class NewsTradingEnv(gym.Env):
    def __init__(self, news_df):
        super().__init__()
        self.news_df = news_df
        self.max_steps = len(news_df)
        self.current_step = 0

        # Action space: 0 = SELL, 1 = HOLD, 2 = BUY
        self.action_space = spaces.Discrete(3)

        # Observation space: one-hot vector for [UP, DOWN, NEUTRAL], dtype must be int8
        self.observation_space = spaces.MultiBinary(3)

    def reset(self, seed=None, options=None):
        self.current_step = 0
        obs = self._get_obs()
        return obs.astype(np.int8), {}

    def _get_obs(self):
        direction = self.news_df.iloc[self.current_step]["direction"]
        if direction == "UP":
            obs = np.array([1, 0, 0], dtype=np.int8)
        elif direction == "DOWN":
            obs = np.array([0, 1, 0], dtype=np.int8)
        else:
            obs = np.array([0, 0, 1], dtype=np.int8)
        return obs

    def step(self, action):
        row = self.news_df.iloc[self.current_step]
        actual_change = row["price_change_pct"]

        reward = 0
        if action == 2 and actual_change > 0:
            reward = actual_change
        elif action == 0 and actual_change < 0:
            reward = -actual_change
        elif action == 1:
            reward = -abs(actual_change) * 0.1

        self.current_step += 1
        done = self.current_step >= self.max_steps
        obs = self._get_obs() if not done else np.array([0, 0, 0], dtype=np.int8)

        return obs.astype(np.int8), reward, done, False, {}

    def render(self):
        print(f"Step {self.current_step}")


In [32]:
from stable_baselines3.common.env_checker import check_env

env = NewsTradingEnv(news_data)
check_env(env, warn=True)


In [33]:
# Run this once if you haven't already
!pip install stable-baselines3[extra]




In [34]:
from stable_baselines3 import PPO
from stable_baselines3.common.env_checker import check_env
from stable_baselines3.common.vec_env import DummyVecEnv

# Wrap our custom environment for stable-baselines3 compatibility
env = NewsTradingEnv(news_data)
check_env(env, warn=True)  # Optional but good for catching bugs
vec_env = DummyVecEnv([lambda: NewsTradingEnv(news_data)])


In [35]:
# Define the PPO agent
model = PPO("MlpPolicy", vec_env, verbose=1)

# Train the agent
model.learn(total_timesteps=1000)


Using cpu device
-----------------------------
| time/              |      |
|    fps             | 2340 |
|    iterations      | 1    |
|    time_elapsed    | 0    |
|    total_timesteps | 2048 |
-----------------------------


<stable_baselines3.ppo.ppo.PPO at 0x1e64f1d4450>

In [36]:
# Evaluate trained agent
env = NewsTradingEnv(news_data)
obs, _ = env.reset()
done = False
total_reward = 0

while not done:
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, done, _, _ = env.step(action)
    total_reward += reward
    env.render()
    print(f"Action: {action}, Reward: {reward:.2f}")

print(f"Total reward by PPO agent: {total_reward:.2f}")


Step 1
Action: 2, Reward: 1.06
Step 2
Action: 2, Reward: 0.45
Step 3
Action: 2, Reward: 0.23
Step 4
Action: 2, Reward: 0.10
Step 5
Action: 2, Reward: 0.00
Total reward by PPO agent: 1.84
