## Blog Authorship Corpus Exercise (Part 2)

This notebook will use the Kaggle blog dataset prepared in other notebook to assess whether Claude can identify authorship

In [1]:
!pip install anthropic

Collecting anthropic
  Downloading anthropic-0.49.0-py3-none-any.whl.metadata (24 kB)
Collecting anyio<5,>=3.5.0 (from anthropic)
  Downloading anyio-4.9.0-py3-none-any.whl.metadata (4.7 kB)
Collecting distro<2,>=1.7.0 (from anthropic)
  Downloading distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting httpx<1,>=0.23.0 (from anthropic)
  Downloading httpx-0.28.1-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from anthropic)
  Downloading jiter-0.9.0-cp312-cp312-macosx_10_12_x86_64.whl.metadata (5.2 kB)
Collecting sniffio (from anthropic)
  Downloading sniffio-1.3.1-py3-none-any.whl.metadata (3.9 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->anthropic)
  Downloading httpcore-1.0.7-py3-none-any.whl.metadata (21 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->anthropic)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading anthropic-0.49.0-py3-none-any.whl (243 kB)
Downloading anyio-4.9.0-py3-none-any.whl (100 kB)
Downloa

In [27]:
import os
from credentials import get_credentials_claude
from anthropic import Anthropic
import pickle
import pandas as pd

In [2]:
API_KEY = get_credentials_claude()
client = Anthropic(api_key=API_KEY)

In [3]:
# Create a completion function that will be used to query the model
def get_completion(prompt, max_tokens=1000):
    response = client.messages.create(
        model="claude-3-5-sonnet-20240620",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=max_tokens,)
    return response.content[0].text

In [4]:
#Load the data
with open('train_set.pkl', 'rb') as f:
    train_set = pickle.load(f)

with open('test_set.pkl', 'rb') as f:
    test_set = pickle.load(f)

In [8]:
train_set.head()

Unnamed: 0,id,gender,age,topic,sign,date,text,text_count
0,449628,male,34,indUnk,Aries,"05,June,2003",urlLink A Day in the Country 20...,4221
1,449628,male,34,indUnk,Aries,"20,February,2003",urlLink DE Japan : Resources : Career...,4221
2,734562,female,24,Arts,Libra,"03,August,2004",You ain't fat! You ain't nothin'! ...,2301
3,734562,female,24,Arts,Libra,"03,August,2004",so no one was amused by the old JLS...,2301
4,589736,male,35,Technology,Aries,"05,August,2004",i'm sorry that i didn't let the gro...,2294


### So I need to create a prompt that will connect the "text" column with the "id" column in the train set and then ask about a random text in the test set and have the model try to guess from which id it came from. 

In [5]:
# Get pairs or ids and texts to use them as variables in the prompt
train_pairs = train_set[['id', 'text']].to_dict(orient='records')


In [6]:
#get a list of author ids
author_ids = train_set['id'].unique()
len(author_ids)


10

In [10]:
# First select a random row from test_set
random_row = test_set.sample(n=1, random_state=42)  # random_state for reproducibility
random_text = random_row['text'].iloc[0]
random_id = random_row['id'].iloc[0]  # We'll keep this to check if the model's prediction is correct

In [12]:
# First version of the prompt
prompt = f"""
You are an authorship analysis expert. You will be given examples of two texts from each of 10 different authors.\
You should read the samples carefully and identify linguistic features that distinguish one author from the others.\
Then you will be given a new text, and your job is to identify which author it most likely belongs to.

Here are the authors and the texts:
<author_samples>
<author_id>
{train_pairs[0]['id']}
</author_id>
<texts>
{train_pairs[0]['text']}
{train_pairs[1]['text']}
</text>

<author_id>
{train_pairs[2]['id']}
</author_id>
<texts>
{train_pairs[2]['text']}
{train_pairs[3]['text']}
</texts>

<author_id>
{train_pairs[4]['id']}
</author_id>
<texts>
{train_pairs[4]['text']}
{train_pairs[5]['text']}
</texts>

<author_id>
{train_pairs[6]['id']}
</author_id>
<texts>
{train_pairs[6]['text']}
{train_pairs[7]['text']}
</texts>

<author_id>
{train_pairs[8]['id']}
</author_id>
<texts>
{train_pairs[8]['text']}
{train_pairs[9]['text']}
</texts>

<author_id>
{train_pairs[10]['id']}
</author_id>
<texts>
{train_pairs[10]['text']}
{train_pairs[11]['text']}
</texts>

<author_id>
{train_pairs[12]['id']}
</author_id>
<texts>
{train_pairs[12]['text']}
{train_pairs[13]['text']}
</texts>

<author_id>
{train_pairs[14]['id']}
</author_id>
<texts>
{train_pairs[14]['text']}
{train_pairs[15]['text']}
</texts>

<author_id>
{train_pairs[16]['id']}
</author_id>
<texts>
{train_pairs[16]['text']}
{train_pairs[17]['text']}
</texts>

<author_id>
{train_pairs[18]['id']}
</author_id>
<texts>
{train_pairs[18]['text']}
{train_pairs[19]['text']}
</texts>

</author_samples>

Now you will be given a new text, and your job is to identify which author it most likely belongs to.

<new_text>
{random_text}
</new_text>

Present your answer in the following format:
<answer>
Author ID: [Your prediction]
</answer>
"""



list

In [23]:
#Improved version (with LLM help)

prompt = f"""
You are an expert in linguistic analysis and authorship attribution. Your task is to analyze authorship--i.e. idiolectal--markers that the \
texts reveal about their authors.


You will be provided with:
1. Training samples from 10 different authors (2 texts per author)
2. A new text whose author you need to identify

Guidelines for analysis:
- Pay attention to writing style elements
- Note especially the grammatical structure of the text and grammatical items, such as\
connectors, prepositions, verbs, phrases, idioms, collocations, etc.
- Consider both obvious and subtle patterns
- Look for consistent features across an author's texts
- Consider any unique or distinctive elements


Here are the training samples:

<author_samples>
<author_id>
{train_pairs[0]['id']}
</author_id>
<texts>
{train_pairs[0]['text']}
{train_pairs[1]['text']}
</text>

<author_id>
{train_pairs[2]['id']}
</author_id>
<texts>
{train_pairs[2]['text']}
{train_pairs[3]['text']}
</texts>

<author_id>
{train_pairs[4]['id']}
</author_id>
<texts>
{train_pairs[4]['text']}
{train_pairs[5]['text']}
</texts>

<author_id>
{train_pairs[6]['id']}
</author_id>
<texts>
{train_pairs[6]['text']}
{train_pairs[7]['text']}
</texts>

<author_id>
{train_pairs[8]['id']}
</author_id>
<texts>
{train_pairs[8]['text']}
{train_pairs[9]['text']}
</texts>

<author_id>
{train_pairs[10]['id']}
</author_id>
<texts>
{train_pairs[10]['text']}
{train_pairs[11]['text']}
</texts>

<author_id>
{train_pairs[12]['id']}
</author_id>
<texts>
{train_pairs[12]['text']}
{train_pairs[13]['text']}
</texts>

<author_id>
{train_pairs[14]['id']}
</author_id>
<texts>
{train_pairs[14]['text']}
{train_pairs[15]['text']}
</texts>

<author_id>
{train_pairs[16]['id']}
</author_id>
<texts>
{train_pairs[16]['text']}
{train_pairs[17]['text']}
</texts>

<author_id>
{train_pairs[18]['id']}
</author_id>
<texts>
{train_pairs[18]['text']}
{train_pairs[19]['text']}
</texts>
</author_samples>

Now, analyze this new text and identify which author it most likely belongs to.

<new_text>
{random_text}
</new_text>


As an answer, give only the predicted id, no other text.
"""

In [24]:
first_prediction = get_completion(prompt)
print(first_prediction)

303162


In [21]:
# Check if the model's predicted author id exsists
author_ids = train_set['id'].unique()

# Split the string by "Author ID: " and take the second part
id_part = first_prediction.split("Author ID: ")[1]
# Split by newline and take the first part to get just the number
predicted_id = int(id_part.split('\n')[0].strip())

if predicted_id in author_ids:
    print(f"The predicted author id {predicted_id} exists in the training set.")
else:
    print(f"The predicted author id {predicted_id} does not exist in the training set.")


The predicted author id 303162 exists in the training set.


In [22]:
# Assess the accuracy of the model's prediction

if predicted_id == random_id:
    print("The model's prediction is correct.")
else:
    print("The model's prediction is incorrect.")


The model's prediction is incorrect.


In [25]:
# Create an experiment with 100 predictions

# Initialize an empty list to store all predictions
all_predictions = []

# Loop through 100 iterations
for i in range(100):
    try:
        # Select a random row from test_set
        random_row = test_set.sample(n=1, random_state=i)  # Using i as random_state for reproducibility
        random_text = random_row['text'].iloc[0]
        random_id = random_row['id'].iloc[0]
        
        # Get the prediction from the model
        prediction = get_completion(prompt)
        
        # Check if prediction is empty
        if not prediction:
            print(f"Iteration {i}: Empty prediction received")
            continue
            
        # Try to convert the prediction to integer
        try:
            predicted_id = int(prediction)
        except ValueError:
            print(f"Iteration {i}: Invalid prediction format: {prediction}")
            continue
            
        # Store the results
        all_predictions.append({
            'actual_id': random_id,
            'predicted_id': predicted_id,
            'is_correct': random_id == predicted_id
        })
        
        # Print progress every 10 iterations
        if (i + 1) % 10 == 0:
            print(f"Completed {i + 1} iterations")
            
    except Exception as e:
        print(f"Iteration {i}: Error occurred - {str(e)}")
        continue



Completed 10 iterations
Completed 20 iterations
Completed 30 iterations
Completed 40 iterations
Completed 50 iterations
Completed 60 iterations
Completed 70 iterations
Completed 80 iterations
Completed 90 iterations
Completed 100 iterations


In [28]:
# Check if we have any predictions
if not all_predictions:
    print("No valid predictions were collected")
else:
    # Convert the list of predictions to a DataFrame
    predictions_df = pd.DataFrame(all_predictions)
    
    # Calculate and print overall accuracy
    accuracy = predictions_df['is_correct'].mean()
    print(f"\nOverall accuracy: {accuracy:.2%}")
    print(f"Total predictions: {len(predictions_df)}")
    print(f"Correct predictions: {predictions_df['is_correct'].sum()}")


Overall accuracy: 8.00%
Total predictions: 100
Correct predictions: 8


In [31]:
# Determine what would be the accuracy by random guessing
accuracy_random = 1 / len(author_ids) * 100
print(f"Random guessing accuracy: {accuracy_random:.2%}")

# Determine what would be the accuracy by guessing the most frequent author
most_frequent_author = train_set['id'].mode()[0]
accuracy_most_frequent = train_set[train_set['id'] == most_frequent_author]['id'].count() / len(author_ids) * 100
print(f"Accuracy guessing the most frequent author: {accuracy_most_frequent:.2%}")

Random guessing accuracy: 10.00%
Accuracy guessing the most frequent author: 303162
