# Repo Link
https://github.com/mr-kelsey/fa25-aai520-group6

# Objective
Build an autonomous Investment Research Agent that:
1. Plans its research steps for a given stock symbol.
2. Uses tools dynamically (APIs, datasets, retrieval).
3. Self-reflects to assess the quality of its output.
4. Learns across runs (e.g., keeps brief memories or notes to improve future analyses).

# Code

In [None]:
"""Import required libraies and load environmentals"""
from dotenv import dotenv_values
from huggingface_hub import login
from smolagents import ToolCallingAgent, InferenceClientModel
from warnings import filterwarnings

from tools import chaining, performance, risk, sentiment, impact, commander, evaluator

# Gluonts uses an outdated pd.df access method that causes a warning.  We are silencing it here to provide a cleaner output
filterwarnings("ignore")
env = dotenv_values(".env")

## Prompt Chaining:

### Ingest News

In [2]:
links = chaining.retrieve_article_links("BITB")
links

['https://finance.yahoo.com/news/uptober-momentum-shift-focus-back-151200736.html',
 'https://finance.yahoo.com/m/467e380b-e9a8-32c5-8e24-b43f00fa2f67/morgan-stanley-plans-to.html',
 'https://finance.yahoo.com/m/fc13754d-4f82-3832-a683-7d5d5c10e256/new-bitcoin-whales-are.html',
 'https://finance.yahoo.com/m/37c06cd2-3def-31c7-abb6-cb591ede9580/bitcoin-etfs-pull-in-%24676m-as.html',
 'https://finance.yahoo.com/m/aeb9b0c3-7b75-3a6e-be2e-8c30f937870c/will-bitcoin-experience-a.html',
 'https://finance.yahoo.com/m/dab01a2c-98c7-3b68-8e45-bf5fe85e5c20/blackrock%E2%80%99s-ibit-is-nearing.html',
 'https://finance.yahoo.com/m/50dfd9d6-ce84-3546-afdf-181edf2ec8d4/bitcoin-etfs-haul-in-%241.19.html',
 'https://finance.yahoo.com/m/cb9cfb50-d9b0-3f56-8190-54e292ab4a37/bitcoin-etfs-smash-%241.19b.html']

### Preprocess

In [3]:
articles = chaining.preprocess(links)
articles[0]

['September proved volatile for Bitcoin, though the digital asset regained momentum late in the month. From Sept. 24 to Oct. 6, it rose 14%.',
 'Increasing interest from institutional investors is sending a positive signal to the market, reflecting the confidence of the world’s largest institutions in digital currency. Additionally, pro-crypto moves by the Trump administration and the digital currency’s growing ties to the broader financial ecosystem are another key tailwind for the asset.',
 'Bitcoin’s volatility has been a constant theme this year, led by trade war uncertainties. However, the fundamental drivers of digital currencies are expected to remain robust and support the anticipated stability ahead.',
 'Further Fed interest rate cuts could boost investor risk appetite, potentially leading to increased exposure to digital currencies. Additionally, lower interest rates would leave investors with more capital, often leading to increased interest in cryptocurrency.',
 'According 

### Classify

In [4]:
sentiment_scores = chaining.classify(articles)
sentiment_scores

Device set to use cpu


[1, 0, -1, 1, 0, 0, 1, 1]

### Extract

In [5]:
statistics = chaining.extract(sentiment_scores)
statistics

{'qty': 8, 'pos': 4, 'tral': 3, 'neg': 1}

### Summarize

In [6]:
summary = chaining.summarize(statistics)
summary

'\nOf the 8 articles reviwed, 0.5% were positive, 0.125% were negative, and 0.375% were neutral.\nThe overall sentiment score as calculated by reviewing these articles is 0.375. This indicates that the overall sentiment\nabout this financial instrument is slightly positive.\n'

### Create 'Agent'

In [None]:
"""Putting it all together to simulate prompt chaining steps that would be taken by a single agent"""
def prompt_chaining_agent(financial_instrument):
    # Ingest News
    links = chaining.retrieve_article_links(financial_instrument)
	# Preprocess
    articles = chaining.preprocess(links)
	# Classify
    sentiment_scores = chaining.classify(articles)
	# Extract
    statistics = chaining.extract(sentiment_scores)
	# Summarize
    return chaining.summarize(statistics)

prompt_chaining_agent("NVDA")

Device set to use cpu


'\nOf the 8 articles reviwed, 0.25% were positive, 0.0% were negative, and 0.75% were neutral.\nThe overall sentiment score as calculated by reviewing these articles is 0.25. This indicates that the overall sentiment\nabout this financial instrument is slightly positive.\n'

## Routing:

In [None]:
"""Log in to hugging face to use headless inference routing"""
login(token=env["AUTH"])

In [4]:
"""Configure agents"""
base_model = InferenceClientModel(model_id="Qwen/Qwen3-Coder-30B-A3B-Instruct")

performance_agent = ToolCallingAgent(
    name="Perforance",
    description="Analyze stock performance, historical growth, and future potential.",
    model=base_model,
    tools=performance.tools,
    )

risk_agent = ToolCallingAgent(
    name="Risk",
    description="Measure risk, volatility, and downside probability.",
    model=base_model,
    tools=risk.tools,
    )

sentiment_agent = ToolCallingAgent(
    name="Sentiment",
    description="Gauge market mood through news and media sentiment.",
    model=base_model,
    tools=sentiment.tools,
    )

impact_agent = ToolCallingAgent(
    name="Impact",
    description="Evaluate sustainability, innovation, and ethical governance.",
    model=base_model,
    tools=impact.tools,
    )

evaluation_agent = ToolCallingAgent(
    name="Evaluator",
    description="Audit, validate, and optimize Commander's final decision.",
    model=base_model,
    tools=evaluator.tools,
    )

commander_agent = ToolCallingAgent(
    name="Commander",
    description="Synthesize all agent outputs and deliver final classification.",
    model=base_model,
    tools=commander.tools,
    managed_agents=[performance_agent, risk_agent, sentiment_agent, impact_agent, evaluation_agent],
    )

commander_agent.visualize()

In [None]:
task1 = "NVIDIA stock (NVDA)"
additional_args={"news_api_key": env["NEWS_API"], "alpha_api_key": env["ALPHA_API"]}

commander_agent.run(f"""Your task is to classify each of the following financial instruments into “BUY”, “HOLD”, or “AVOID”:
1. {task1}

You have a very capable team below you that you should leverage to accomplish your task.  Here are the profiles of each of your team members:
`Performance`
	Role: Calculates a performance score based on the model prediction of future financial instrument movement.
    Requires: alpha_api_key
	Output: A performance score between 0 and 1 with 1 being the most performant.
    
`Risk`
	Role: Calculates a risk score based on downside deviation.
	Output: A risk score between -1 and 1 where 1 means “High Risk", -1 menas High Reward”, and 0 represents a “Stable Performer”.
    
`Sentiment`
	Role: Calculates a sentiment score from news articles related to the financial instrument in question.
    Requires: news_api_key
	Output: A sentiment score consisting of the mean value of the sentiment classifications for all articles analyzed.
    
`Impact`
	Role: Calculates an impact score based on simulation.
	Output: An impact score between 0 and 1 where the higher the score, the greater the impact.

`Evaluator`
	Role: Reviews your logic and determine if the logic is sound.
    Output: A binary classification: "PASS" if the logic is sound, "FAIL" if the logic is unsound.
    
    
An example workflow would be:
1) Task each of your workers with doing their job:
	i) Tell Performance to calculate a performance score using its tools.
    ii) Tell Risk to calculate a risk score using its tools.
    iii) Tell Sentiment to calculate a sentiment score using its tools.
    iv) Tell Impact to calculate an impact score using its tools.
2) Use your team member's scores to determine a reccomendation.
3) Send the scores from your team along with your recommendation to Evaluator for final sign-off.
4) Decide what to do next based on what Evaluator returns:
	i) If Evaluator returns "PASS":
		5) Output your final recommendation: “BUY”, “HOLD”, or “AVOID”
    ii) If Evaluator returns "FAIL":
		5) Use Evaluator's `Additional context` to see where things went wrong.
        6) Output your initial recommendation along with any suggestions for making it better as indicated by Evaluator.
""", additional_args=additional_args)


[33m╭────────────────────────────────────────────── New run - Commander ──────────────────────────────────────────────╮[0m
[33m│[0m                                                                                                                 [33m│[0m
[33m│[0m Your task is to classify each of the following financial instruments into “BUY”, “HOLD”, or “AVOID”:            [33m│[0m
[33m│[0m 1. NVIDIA stock (NVDA)                                                                                          [33m│[0m
[33m│[0m                                                                                                                 [33m│[0m
[33m│[0m You have a very capable team below you that you should leverage to accomplish your task.  Here are the profiles [33m│[0m
[33m│[0m of each of your team members:                                                                                   [33m│[0m
[33m│[0m `Performance`                                              

## Evaluator–Optimizer:

If we had more time, we would wire this up properly so that the hyperparameters of each agent could be changed by the commander.  We would also add much more robust output to the evaluator so that the evaluator could return specific guidance and the commander could implement said guidance by way of adjusting the appropriate hyperparameters.  However, doing so would require an overhaul of our tools: moving them from functions to classes.  This would be a great future extension of this project and would be neccessary to use this system at enterprise level.

For now, we will simply create a mock-up of the process to show understanding.

### Generate analysis

In [28]:
"""Mock-up of recommendation / evaluation / optimization process"""
def make_recomendation(performance_score, risk_score, sentiment_score, impact_score, performance_weight=.9, risk_weight=.7, sentiment_weight=.5, impact_weight=.25):
    final_score = (performance_score * performance_weight) - (risk_score * risk_weight) + (sentiment_score * sentiment_weight) + (impact_score * impact_weight)
    if final_score > 0.7:
        return "BUY"
    if final_score > 0.4:
        return "HOLD"
    return "AVOID"

# Fake data for demonstration
performance_score = .5
risk_score = .9
sentiment_score = .5
impact_score = .9

recommendation = make_recomendation(performance_score, risk_score, sentiment_score, impact_score)
recommendation


'AVOID'

### Evaluate quality

In [29]:
"""Evaluate the recommendation with more specific feedback than simple pass / fail"""
def evaluate_recommendation(performance_score, risk_score, sentiment_score, impact_score, recommendation):
    if recommendation == "BUY":
        if performance_score < 0.15:
            return "Performance is too low, consider increasing your performance weight"
        if risk_score > 0.7:
            return "Risk is too high, consider increasing your risk weight"
        return "PASS"
    
    elif recommendation == "HOLD":
        if sentiment_score < -0.8:
            return "Too negative a sentiment, consider increasing you sentiment weight"
        return "PASS"
    
    elif recommendation == "AVOID":
        if impact_score >= 0.90 and sentiment_score >= 0.5:
            return "Too important to pass up, consider increasing your impact weight"
        return "PASS"
    
evaluation = evaluate_recommendation(performance_score, risk_score, sentiment_score, impact_score, recommendation)
evaluation

'Too important to pass up, consider increasing your impact weight'

### Refine using feedback

In [30]:
"""Adjust hyperparameters according to feedback from evaluation"""
impact_weight = .25
while evaluation != "PASS":
    impact_weight += 0.05
    recommendation = make_recomendation(performance_score, risk_score, sentiment_score, impact_score, impact_weight=impact_weight)
    evaluation = evaluate_recommendation(performance_score, risk_score, sentiment_score, impact_score, recommendation)
    
impact_weight

0.39999999999999997

In [None]:
"""Show new recommendation with adjusted hyperparameter"""
recommendation = make_recomendation(performance_score, risk_score, sentiment_score, impact_score, impact_weight=0.4)
recommendation

'HOLD'

In [None]:
"""Ensure new recommendation passes evaluation"""
evaluation = evaluate_recommendation(performance_score, risk_score, sentiment_score, impact_score, recommendation)
evaluation

'PASS'

# Summary
## Agent Design and Workflows
The agents used throughout this notebook have been tool-calling agents. They do not possess the full autonomy that comes from code agents, but they are more predictable and have better guardrails against becoming too creative in their problem-solving endeavors.  In our example, each agent has access to only one tool outside of the tool used to pass its answer back to the parent model and/or user.  This makes each agent extremely specialized. While a more generalized approach may offer more flexibility, this specialist approach simplified the routing by calling the agents in a very linear manner, much like you would see from a human prompt chaining a single model with multiple tools.

## Agent Functions and Capabilities
Some of our agents are complex, while others are relatively simple.  For instance, our Sentiment agent calculates a VADER sentiment score and a BERT sentiment score from articles pulled from multiple news outlets.  Performance uses DeepAR, a deep ARIMA model, along with linear regression and a short-term moving average over several time periods.  On the simpler side, we have a purely random Impact agent.  However, we could make each model more robust as we discover better ways to calculate meaningful results in the agent's area of expertise.  This is one of the allure of this agentic approach.  We can gradually make each more of an expert in their particular slice of the equation, thus driving greater results over time.

## Evaluation and Iteration
As can be seen from the code, the evaluation mock-up is very similar to our actual agent code.  However, we needed to use a mock-up because our agent code is not dynamic enough to allow the commanding agent to adjust hyperparameters.  Looking at the mock-up, it should be apparent that if the impact weight were adjusted high enough to cause the recommendation to become 'BUY' rather than 'HOLD', the evaluator would then recommend increasing the risk weight because the stock would be considered too risky to purchase.  In this way, we can allow the model to tune itself to an arbitrary precision.  Of course, there would need to be some sort of stop criteria included to guard against excessive (and possibly infinite) tuning iterations.