## Blog Authorship Corpus Exercise (Part 2)

This notebook will use the Kaggle blog dataset prepared in other notebook to assess whether Claude can identify authorship

In [1]:
!pip install anthropic

Collecting anthropic
  Downloading anthropic-0.49.0-py3-none-any.whl.metadata (24 kB)
Collecting anyio<5,>=3.5.0 (from anthropic)
  Downloading anyio-4.9.0-py3-none-any.whl.metadata (4.7 kB)
Collecting distro<2,>=1.7.0 (from anthropic)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from anthropic)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from anthropic)
  Downloading jiter-0.9.0-cp312-cp312-macosx_10_12_x86_64.whl.metadata (5.2 kB)
Collecting sniffio (from anthropic)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->anthropic)
  Downloading httpcore-1.0.7-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->anthropic)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading anthropic-0.49.0-py3-none-any.whl (243 kB)
Downloading anyio-4.9.0-py3-none-any.whl (100 kB)
Downloa

In [2]:
import os
from credentials import get_credentials_claude
from anthropic import Anthropic
import pickle

In [4]:
API_KEY = get_credentials_claude()
client = Anthropic(api_key=API_KEY)

In [27]:
# Create a completion function that will be used to query the model
def get_completion(prompt, max_tokens=1000):
    response = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,)
    return response.content[0].text

In [5]:
#Load the data
with open('train_set.pkl', 'rb') as f:
    train_set = pickle.load(f)

with open('test_set.pkl', 'rb') as f:
    test_set = pickle.load(f)

In [8]:
train_set.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_count
0,449628,male,34,indUnk,Aries,"05,June,2003",urlLink A Day in the Country 20...,4221
1,449628,male,34,indUnk,Aries,"20,February,2003",urlLink DE Japan : Resources : Career...,4221
2,734562,female,24,Arts,Libra,"03,August,2004",You ain't fat! You ain't nothin'! ...,2301
3,734562,female,24,Arts,Libra,"03,August,2004",so no one was amused by the old JLS...,2301
4,589736,male,35,Technology,Aries,"05,August,2004",i'm sorry that i didn't let the gro...,2294


### So I need to create a prompt that will connect the "text" column with the "id" column in the train set and then ask about a random text in the test set and have the model try to guess from which id it came from. 

In [9]:
# Get pairs or ids and texts to use them as variables in the prompt
train_pairs = train_set[['id', 'text']].to_dict(orient='records')


In [16]:
#get a list of author ids
author_ids = train_set['id'].unique()
len(author_ids)


10

In [23]:
train_pairs[2]['id'] == train_pairs[3]['id']

True

In [25]:
# First select a random row from test_set
random_row = test_set.sample(n=1, random_state=42)  # random_state for reproducibility
random_text = random_row['text'].iloc[0]
random_id = random_row['id'].iloc[0]  # We'll keep this to check if the model's prediction is correct

In [12]:
# First version of the prompt
prompt = f"""
You are an authorship analysis expert. You will be given examples of two texts from each of 10 different authors.\
You should read the samples carefully and identify linguistic features that distinguish one author from the others.\
Then you will be given a new text, and your job is to identify which author it most likely belongs to.

Here are the authors and the texts:
<author_samples>
<author_id>
{train_pairs[0]['id']}
</author_id>
<texts>
{train_pairs[0]['text']}
{train_pairs[1]['text']}
</text>

<author_id>
{train_pairs[2]['id']}
</author_id>
<texts>
{train_pairs[2]['text']}
{train_pairs[3]['text']}
</texts>

<author_id>
{train_pairs[4]['id']}
</author_id>
<texts>
{train_pairs[4]['text']}
{train_pairs[5]['text']}
</texts>

<author_id>
{train_pairs[6]['id']}
</author_id>
<texts>
{train_pairs[6]['text']}
{train_pairs[7]['text']}
</texts>

<author_id>
{train_pairs[8]['id']}
</author_id>
<texts>
{train_pairs[8]['text']}
{train_pairs[9]['text']}
</texts>

<author_id>
{train_pairs[10]['id']}
</author_id>
<texts>
{train_pairs[10]['text']}
{train_pairs[11]['text']}
</texts>

<author_id>
{train_pairs[12]['id']}
</author_id>
<texts>
{train_pairs[12]['text']}
{train_pairs[13]['text']}
</texts>

<author_id>
{train_pairs[14]['id']}
</author_id>
<texts>
{train_pairs[14]['text']}
{train_pairs[15]['text']}
</texts>

<author_id>
{train_pairs[16]['id']}
</author_id>
<texts>
{train_pairs[16]['text']}
{train_pairs[17]['text']}
</texts>

<author_id>
{train_pairs[18]['id']}
</author_id>
<texts>
{train_pairs[18]['text']}
{train_pairs[19]['text']}
</texts>

</author_samples>

Now you will be given a new text, and your job is to identify which author it most likely belongs to.

<new_text>
{random_text}
</new_text>

Present your answer in the following format:
<answer>
Author ID: [Your prediction]
</answer>
"""



list

In [28]:
#Improved version (with LLM help)

prompt = f"""
You are an expert in linguistic analysis and authorship attribution. Your task is to analyze writing styles and identify the author of a given text.

You will be provided with:
1. Training samples from 10 different authors (2 texts per author)
2. A new text whose author you need to identify

Guidelines for analysis:
- Pay attention to writing style elements like:
  * Vocabulary choice and complexity
  * Sentence structure and length
  * Punctuation patterns
  * Common phrases or expressions
  * Topic preferences
  * Tone and voice
  * Grammar patterns
- Consider both obvious and subtle patterns
- Look for consistent features across an author's texts
- Note any unique or distinctive elements

Here are the training samples:

<author_samples>
<author_id>
{train_pairs[0]['id']}
</author_id>
<texts>
{train_pairs[0]['text']}
{train_pairs[1]['text']}
</text>

<author_id>
{train_pairs[2]['id']}
</author_id>
<texts>
{train_pairs[2]['text']}
{train_pairs[3]['text']}
</texts>

<author_id>
{train_pairs[4]['id']}
</author_id>
<texts>
{train_pairs[4]['text']}
{train_pairs[5]['text']}
</texts>

<author_id>
{train_pairs[6]['id']}
</author_id>
<texts>
{train_pairs[6]['text']}
{train_pairs[7]['text']}
</texts>

<author_id>
{train_pairs[8]['id']}
</author_id>
<texts>
{train_pairs[8]['text']}
{train_pairs[9]['text']}
</texts>

<author_id>
{train_pairs[10]['id']}
</author_id>
<texts>
{train_pairs[10]['text']}
{train_pairs[11]['text']}
</texts>

<author_id>
{train_pairs[12]['id']}
</author_id>
<texts>
{train_pairs[12]['text']}
{train_pairs[13]['text']}
</texts>

<author_id>
{train_pairs[14]['id']}
</author_id>
<texts>
{train_pairs[14]['text']}
{train_pairs[15]['text']}
</texts>

<author_id>
{train_pairs[16]['id']}
</author_id>
<texts>
{train_pairs[16]['text']}
{train_pairs[17]['text']}
</texts>

<author_id>
{train_pairs[18]['id']}
</author_id>
<texts>
{train_pairs[18]['text']}
{train_pairs[19]['text']}
</texts>
</author_samples>

Now, analyze this new text and identify which author it most likely belongs to. Provide your reasoning by briefly explaining which linguistic features led you to your conclusion.

<new_text>
{random_text}
</new_text>

Please provide your answer in this format:
Author ID: [predicted_id]
Reasoning: [explanation of your analysis]
"""

In [29]:
first_prediction = get_completion(prompt)
print(first_prediction)

Author ID: 303162

Reasoning:

The writing style in this new text most closely matches that of author 303162. Here's why:

1. Sentence structure: The author uses a mix of longer, complex sentences and shorter, punchy ones. This is similar to the sample from 303162, which also shows varied sentence lengths.

2. Stream of consciousness style: The text flows in a somewhat rambling, introspective manner, similar to 303162's style in their samples. There's a sense of the author thinking out loud and exploring their thoughts as they write.

3. Use of parenthetical asides: The author uses parentheses to add extra thoughts, which is also seen in 303162's writing ("(!) Bisi.").

4. Casual tone: The writing has an informal, conversational tone, using phrases like "One more thing:" and "Silly chronicles of life." This matches the casual style seen in 303162's samples.

5. Personal anecdotes: The text focuses on a personal experience (a dream), which aligns with 303162's tendency to share personal

In [30]:
#test the accuracy of the model's prediction
random_id

np.int64(1107146)