Test `citations` in Anthropic

You need to provide documents to the API:
- PDF
- plain txt
- custom file

In [50]:
from IPython.display import display, Markdown

import json
import wikipedia

from anthropic import Anthropic
client = Anthropic()

In [26]:
anthropic_model_name = "claude-3-5-sonnet-20241022"
temperature = 0.0
max_tokens = 8192

system_message = "You are a professional QA assistant that can summarize documents and answer questions."

In [35]:
def call_anthropic_API(messages):
    
    response = client.messages.create(
        model=anthropic_model_name,
        max_tokens=max_tokens,
        temperature=temperature,
        system=system_message,
        messages=messages
    )
    
    return response

def count_anthropic_tokens(text):

    response = client.messages.count_tokens(
            model=anthropic_model_name,
            system="",
            messages=[{
            "role": "user",
            "content": text
        }],
    )

    return eval(response.model_dump_json())

Test connection

In [9]:
user_message = "Concisely, explain the difference between a falcon and a raven."

test_msg = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": user_message
            }
        ]
    }
]

response = call_anthropic_API(test_msg)

print(f'Usage: {response.usage}')
print('Response:\n')
display(Markdown(response.content[0].text))

Usage: Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=39, output_tokens=175)
Response:



Falcons and ravens differ in several key ways:

Falcons are birds of prey (raptors) with pointed wings, hooked beaks, and powerful talons used for hunting. They're known for their incredible speed and agility in flight.

Ravens are corvids (part of the crow family) with broader wings, larger bodies, and straight beaks. They're intelligent scavengers and omnivores, known for problem-solving abilities and complex social behaviors.

The main differences are in their:
- Diet (falcons hunt live prey; ravens eat almost anything)
- Beak shape (hooked vs straight)
- Wing shape (pointed vs broad)
- Size (ravens are generally larger)
- Hunting style (falcons are predators; ravens are opportunistic feeders)

### Test `citations`

1. for txt files we will use a Wikipedia page
2. for PDF we will upload a more complex financial report
3. for custom datasources we will use a structured markdown / text source

#### 1. plain txt

First get some content to a txt file

In [37]:
output_path = "../output/anthropic-citations/tesla-wiki.txt"

tesla_wiki_page = wikipedia.page(title="Tesla, Inc.")

# Write the content to the file
with open(output_path, "w", encoding="utf-8") as f:
    f.write(f"Title: {tesla_wiki_page.title}\n\n")
    f.write(f"URL: {tesla_wiki_page.url}\n\n")
    f.write("Content:\n")
    f.write(tesla_wiki_page.content)

print(f"Content saved to: {output_path}")

Content saved to: ../output/anthropic-citations/tesla-wiki.txt


In [38]:
# Read the file
with open(output_path, "r", encoding="utf-8") as f:
    plain_txt_content = f.read()
    
# Print first 500 characters
print("First 500 characters of the file:")
print("-" * 50)
print(plain_txt_content[:500])
print("-" * 50)
print(f"\nTotal characters in file: {len(plain_txt_content)}")
print(f"Total tokens: {count_anthropic_tokens(plain_txt_content)['input_tokens']}")

First 500 characters of the file:
--------------------------------------------------
Title: Tesla, Inc.

URL: https://en.wikipedia.org/wiki/Tesla,_Inc.

Content:
Tesla, Inc. ( TESS-lə or  TEZ-lə) is an American multinational automotive and clean energy company. Headquartered in Austin, Texas, it designs, manufactures and sells battery electric vehicles (BEVs), stationary battery energy storage devices from home to grid-scale, solar panels and solar shingles, and related products and services.
Tesla was incorporated in July 2003 by Martin Eberhard and Marc Tarpenning as Tesla Mot
--------------------------------------------------

Total characters in file: 82749
Total tokens: 19382


Call API with citations enabled, passing the document

In [53]:
user_message = "Briefly explain what difficultiesTesla faced between 2010 and 2016"

test_msg = [
    {
        "role": "user",
        "content": [
            {
                "type": "document",
                "source": {
                    "type": "text",
                    "media_type": "text/plain",
                    "data": plain_txt_content
                },
                "title": "Tesla Inc.", # optional
                "context": "Wikipedia page about Tesla Inc.", # optional
                "citations": {"enabled": True}
            },
            {
                "type": "text",
                "text": user_message
            }
        ]
    }
]

response_with_citations = call_anthropic_API(test_msg)

In [54]:
print(f'Usage: {response_with_citations.usage}')

Usage: Usage(cache_creation_input_tokens=0, cache_read_input_tokens=0, input_tokens=29523, output_tokens=458)


Citations comes back with a little different structure: there is not only one `content` field, list of text blocks, some including citations blocks.

In [55]:
len(response_with_citations.content)

10

In [56]:
response_with_citations.content

[TextBlock(citations=None, text='Based on the documents, here are the key difficulties Tesla faced between 2010-2016:\n\n1. Production and Manufacturing Challenges:\n', type='text'),
 TextBlock(citations=[CitationCharLocation(cited_text='To speed up production and control costs, Tesla invested heavily in robotics and automation to assemble the Model 3, but the robotics actually slowed the production of the vehicles. This led to significant delays and production problems, a period which the company described as "production hell". ', document_index=0, document_title='Tesla Inc.', end_char_index=8056, start_char_index=7759, type='char_location')], text='When trying to produce the Model 3, Tesla invested heavily in robotics and automation for assembly, but this actually slowed down production and led to significant delays and problems, a period which the company described as "production hell."', type='text'),
 TextBlock(citations=None, text='\n\n2. Financial Pressures:\n', type='text'),
 T

In [58]:
print(response_with_citations.content[1].text)

When trying to produce the Model 3, Tesla invested heavily in robotics and automation for assembly, but this actually slowed down production and led to significant delays and problems, a period which the company described as "production hell."


In [57]:
response_with_citations.content[1].citations

[CitationCharLocation(cited_text='To speed up production and control costs, Tesla invested heavily in robotics and automation to assemble the Model 3, but the robotics actually slowed the production of the vehicles. This led to significant delays and production problems, a period which the company described as "production hell". ', document_index=0, document_title='Tesla Inc.', end_char_index=8056, start_char_index=7759, type='char_location')]

Raw response

In [59]:
print("\n" + "="*80 + "\nRaw response:\n" + "="*80)
raw_response = {
    "blocks": []
}

for content in response_with_citations.content:
    if content.type == "text":
        block = {
            "text": content.text,
        }
        if hasattr(content, 'citations') and content.citations:
            block["citations"] = [
                {
                    "type": c.type,
                    "cited_text": c.cited_text,
                    "document_index": c.document_index,
                    "document_title": c.document_title,
                    "start_char_index": c.start_char_index,
                    "end_char_index": c.end_char_index
                } for c in content.citations
            ]
        raw_response["blocks"].append(block)

print(json.dumps(raw_response, indent=2))


Raw response:
{
  "blocks": [
    {
      "text": "Based on the documents, here are the key difficulties Tesla faced between 2010-2016:\n\n1. Production and Manufacturing Challenges:\n"
    },
    {
      "text": "When trying to produce the Model 3, Tesla invested heavily in robotics and automation for assembly, but this actually slowed down production and led to significant delays and problems, a period which the company described as \"production hell.\"",
      "citations": [
        {
          "type": "char_location",
          "cited_text": "To speed up production and control costs, Tesla invested heavily in robotics and automation to assemble the Model 3, but the robotics actually slowed the production of the vehicles. This led to significant delays and production problems, a period which the company described as \"production hell\". ",
          "document_index": 0,
          "document_title": "Tesla Inc.",
          "start_char_index": 7759,
          "end_char_index": 8056
  

In [63]:
output_data = {
    "user_query": user_message,  
    "raw_response": raw_response  
}

# Save to JSON file
output_file = "../output/anthropic-citations/response_with_citations-json.json"

with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print(f"Data saved to: {output_file}")

Data saved to: ../output/anthropic-citations/response_with_citations-json.json


In [64]:
def visualize_citations(response):
    """
    Takes a response object and returns a string with numbered citations.
    Example output: "here is the plain text answer [1][2] here is some more text [3]"
    with a list of citations below.
    """
    # Dictionary to store unique citations
    citations_dict = {}
    citation_counter = 1
    
    # Final formatted text
    formatted_text = ""
    citations_list = []
    
    for content in response.content:
        if content.type == "text":
            text = content.text
            if hasattr(content, 'citations') and content.citations:
                # Sort citations by their appearance in the text
                sorted_citations = sorted(content.citations, 
                                         key=lambda x: x.start_char_index)
                
                # Process each citation
                for citation in sorted_citations:
                    doc_title = citation.document_title
                    cited_text = citation.cited_text.replace('\n', ' ').replace('\r', ' ')
                    # Remove any multiple spaces that might have been created
                    cited_text = ' '.join(cited_text.split())
                    
                    # Create a unique key for this citation
                    citation_key = f"{doc_title}:{cited_text}"
                    
                    # If this is a new citation, add it to our dictionary
                    if citation_key not in citations_dict:
                        citations_dict[citation_key] = citation_counter
                        citations_list.append(f"[{citation_counter}] \"{cited_text}\" found in \"{doc_title}\"")
                        citation_counter += 1
                    
                    # Add the citation number to the text
                    citation_num = citations_dict[citation_key]
                    text += f" [{citation_num}]"
            
            formatted_text += text
    
    # Combine the formatted text with the citations list
    final_output = formatted_text + "\n\n" + "\n".join(citations_list)
    return final_output

formatted_response = visualize_citations(response_with_citations)

output_path = "../output/anthropic-citations/response_with_citations-txt.txt"

# Write the content to the file
with open(output_path, "w", encoding="utf-8") as f:
    f.write(formatted_response)

print(f"Content saved to: {output_path}")

print(formatted_response)

Content saved to: ../output/anthropic-citations/response_with_citations-txt.txt
Based on the documents, here are the key difficulties Tesla faced between 2010-2016:

1. Production and Manufacturing Challenges:
When trying to produce the Model 3, Tesla invested heavily in robotics and automation for assembly, but this actually slowed down production and led to significant delays and problems, a period which the company described as "production hell." [1]

2. Financial Pressures:
The production difficulties put significant financial pressure on Tesla, and during this time it became one of the most shorted companies in the stock market. [2]

3. Battery and Technical Issues:
In 2013, there were safety concerns when a Model S caught fire after hitting metal debris on a highway in Kent, Washington. Tesla confirmed the fire began in the battery pack and was caused by impact. As a result, Tesla had to extend its vehicle warranty to cover fire damage. By March 2014, Tesla had to implement addit