# Limitations of LLM Reasoning

In [1]:
#%pip install --upgrade --quiet pydantic-ai-slim[anthropic,openai]

In [2]:
GEMINI="gemini-2.0-flash"
OPENAI="gpt-4o-mini"
CLAUDE="claude-3-7-sonnet-latest"

import os
from dotenv import load_dotenv
load_dotenv("../keys.env")
assert os.environ["GEMINI_API_KEY"][:2] == "AI",\
       "Please specify the GEMINI_API_KEY access token in keys.env file"
assert os.environ["ANTHROPIC_API_KEY"][:2] == "sk",\
       "Please specify the ANTHROPIC_API_KEY access token in keys.env file"
assert os.environ["OPENAI_API_KEY"][:2] == "sk",\
       "Please specify the OPENAI_API_KEY access token in keys.env file"

In [3]:
# Needed in Jupyter environment See: https://ai.pydantic.dev/troubleshooting/ 
import nest_asyncio
nest_asyncio.apply()

In [4]:
def zero_shot(model_id, prompt: str) -> str:
    from pydantic_ai import Agent
    agent = Agent(model_id)
    result = agent.run_sync(prompt)
    return (result.data)

## Prime number

In [6]:
print(zero_shot(GEMINI, "List the prime numbers between 100 and 110"))

The prime numbers between 100 and 110 are:

*   **101**
*   **103**
*   **107**
*   **109**


## Apartment size

In [7]:
print(zero_shot(OPENAI, "How many square feet is an apartment that is 84 sq meters?"))

To convert square meters to square feet, you can use the conversion factor that 1 square meter is approximately 10.7639 square feet.

So, to convert 84 square meters to square feet:

\[ 
84 \, \text{sq meters} \times 10.7639 \, \text{sq feet/sq meter} \approx 903.20 \, \text{sq feet} 
\]

Therefore, an apartment that is 84 square meters is approximately 903.20 square feet.


In [8]:
84*10.7639

904.1676

The approach is correct, but the calculation is hallucinated.

## Bridge Maxim

In [11]:
for model in [GEMINI, OPENAI, CLAUDE]:
    print(model, ":\n",
          zero_shot(model, 'In bridge, what does the maxim "eight ever, nine never" mean? Respond in 3-5 sentences.'),
          "\n\n")

gemini-2.0-flash :
 The maxim "eight ever, nine never" is a guideline for playing suits in bridge, particularly when trying to win tricks. It refers to the number of cards you and your partner hold in a particular suit. If you have a combined total of eight cards in a suit, you should generally try to finesse to make extra tricks in that suit. However, if you have a combined total of nine cards, you should generally play for the drop, as finessing is less likely to succeed and can give the opponents extra tricks.
 


gpt-4o-mini :
 In bridge, the maxim "eight ever, nine never" refers to the principles of determining the optimal number of cards to play when considering whether to bid or support a suit. Specifically, if you have eight cards in a particular suit, you should be willing to support that suit. However, if you have nine cards, it's usually better to not simply rely on that long suit for game consideration, as it may be more beneficial to explore other options for bidding. This

## Suit play based on maxim

The expert line is to play the Ace, and then the King if the ten doesn’t fall on the right. If the ten falls, then come to hand in another suit and take the finesse. 

An intermediate player following the maxim would cash the Ace and the King because it’s slightly higher odds than a first-round finesse.

Either answer would be acceptable.

In [13]:
for model in [GEMINI, OPENAI, CLAUDE]:
    print(model, ":\n",
          zero_shot(model, 'In bridge, holding AKJxx opposite four small, how should you play the the suit for no losers?'),
          "\n\n")

gemini-2.0-flash :
 The best way to play this suit combination (AKJxx opposite xxxx) in bridge, aiming for no losers, depends on the lead and the number of remaining cards in the suit. Here's a breakdown of the different scenarios and how to play them:

**Understanding the Situation:**

*   **You have:** AKJxx in one hand, and four small cards (xxxx) in the other. This is a fairly strong suit holding.
*   **Goal:** To avoid losing any tricks in this suit.

**Scenarios and How to Play:**

**1. Opposition Leads the Suit:**

*   **If the opponent leads *low* to your hand:**
    *   **Play the Jack (J).** This forces the opponent to win the trick.
    *   *   **If the opponent wins with the Queen (Q) or Ten (T) then you can win all the rest of the tricks with your Ace and King**
*   **If the opponent leads the Ace, King, Queen or Ten.**
    *   **Play a small card and hope one of the other players has the Queen or the Ten. You can then win all the remaining tricks with the Ace, King and Ja

All three of the models have answered wrong. To wit:

Gemini suggests that you "Lead low to your hand with four small."  This is so horribly wrong. Not even a beginning card player would make this mistake.

GPT gets the right intermediate line (cash the Ace and King), but for the wrong reasons. It doesn't realize that the point is that the opponents' cards are likely to be split 2-2.

Claude seems to pick the expert line (the paragraph about the line giving the best chance against most distributions is the correct description of the expert line), but the line of play it describes is wrong. The point about the expert line is the take the finesse only if the Ten appears on the right.

## Bridge maxim and line of play

What if we ask the conversational interfaces that have access to tools like web search etc.?

Here's our links to the appropriate sessions:

ChatGPT: https://chatgpt.com/share/67f34e98-d930-8006-aa3b-e47d3b67554f

Gemini: https://g.co/gemini/share/e05094727453

Claude: https://claude.ai/share/fc1ce42e-8db4-46f3-a1b0-2be9fd13b14f

All three get the maxim right and the line of play wrong.