# Welcome!

The goal of this week is to develop and test a narrative archetype annotation system.

The dataset you will be using is called "WikiData_Summaries_Labeled_Sample.csv"

It contains 50 summaries of famous literary works drawn from Wikipedia. It also contains annotations provided by GPT-4 for:

- protagonist = main character's name
- pro.type = hero/villain/victim
- antagonist = the negative "force" of the story (corruption, violence, plague...)
- topic = general thematic focus of the story
- moral = one sentence story moral
- moral.pos = three keywords that are the central morals of the story
- moral.neg = single negative moral (x is a bad a behavior)
- archetype = story archetype (free form based on GPT's own knowledge)

You make take one of several directions for this week's report.

1. Compare numerous models of different sizes on a single labeling task. How often do they agree? When they disagree, what does this tell you? Do you see semantic biases in how they respond to the task or do different models give very different answers?

2. Invent your own archetype task. Here you would define a high-level dimension of narrative that you wish to classify. You can either use an open-ended approach or a pre-set taxonomy. For your report you will motivate why you chose this design and what you hoped to capture (why is it valuable?). Discuss how a given model performs on the task. You do not need to report an overall accuracy score but discuss salient examples and how they meet / fail to meet your expectations and why. Note: do not report that your model totally failed. In that case choose a different model or a different task! You can get inspiration from tv-tropes: https://tvtropes.org/pmwiki/pmwiki.php/Main/NarrativeTropes

Note that for this week's report you will need to supply missing analytical steps, whether it is code to compare model outputs or code to design a prompt workflow and evaluation.

# 1. Install Prerequisite Libraries
The below code will depend on three Python "libraries" (software collections). Run the below cell once to install them.

In [None]:
!pip install matplotlib

# 2. Establish Your Working Directory

For our projects this semester we will upload a .csv file that has a "text" column. This will be our input to the language model.

First establish your working directory. Create a folder called "Jupyter" and put it in your Documents folder. Then run this cell.

In [None]:
# Import the libraries we need
from pathlib import Path  # This helps us work with file paths
import os                # This lets us change directories

def use_jupyter_folder():
    # Get the path to the Jupyter folder
    jupyter_folder = Path.home() / 'Documents' / 'Jupyter'
    
    # Try to change to that directory
    if jupyter_folder.exists():
        os.chdir(jupyter_folder)
        print(f"✅ Now using your Jupyter folder!")
        print(f"Current working directory: {Path.cwd()}")
    else:
        print("❌ Couldn't find the Jupyter folder in Documents.")
        print("Please make sure you've created it first.")

# Run this to switch to the Jupyter folder
use_jupyter_folder()

# 3. Upload Your Data

Next upload a .csv file. Paste the filename where indicated. This cell will output the column names.

In [None]:
########## CONFIGURATION VARIABLES ###########
FILENAME = "NarraDetect_Scalar.csv"  # Your CSV filename here

## Define Function
import pandas as pd

def load_csv(filename):
   """Load CSV file and display info"""
   try:
       df = pd.read_csv(filename)
       print(f"✅ Successfully loaded {filename}")
       print(f"Shape: {df.shape[0]} rows, {df.shape[1]} columns")
       print("\nColumns in this dataset:")
       for col in df.columns:
           print(f"- {col}")
       return df
   except FileNotFoundError:
       print(f"❌ Could not find {filename} in {Path.cwd()}")
   except Exception as e:
       print(f"❌ Error loading file: {str(e)}")

## Run function
df = load_csv(FILENAME)


# 4. Inspect Your Data

This cell will give you brief summary statistics on the input text column. This is the column you will use as part of your prompting.

In [None]:
########## CONFIGURATION VARIABLES ###########
TEXT_COLUMN = 'TEXT'    # Column containing text data
NUM_EXAMPLES = 2        # Number of example texts to display

########## FUNCTION DEFINITION ###########
def text_stats(df, text_column=TEXT_COLUMN, num_examples=NUM_EXAMPLES):
   """Display text statistics and examples"""
   # Calculate word counts
   word_counts = df[text_column].str.split().str.len()
   total_words = word_counts.sum()
   
   print(f"📊 Dataset Overview:")
   print(f"Total number of texts: {len(df)}")
   
   print(f"\n📝 Text Length Statistics:")
   print(f"Shortest text: {word_counts.min()} words")
   print(f"Longest text: {word_counts.max()} words")
   print(f"Average length: {word_counts.mean():.1f} words")
   print(f"Median length: {word_counts.median():.1f} words")
   print(f"Total words in dataset: {total_words:,} words")
   
   print(f"\n📚 Here are {num_examples} example texts from your data:")
   for i in range(num_examples):
       idx = df.index[i]
       text = df.loc[idx, text_column]
       length = len(text.split())
       print(f"Example {i+1}:")
       print(f"Length: {length} words")
       print(f"Text: {text}")

# Calculate statistics and show examples
text_stats(df)

# 5. Define your Ollama model

You will run this cell only once for the semester. Once the model is loaded you don't need to run it again.
But you do need to run it every time you want to test a new model.

In [1]:
model = "llama3:8b"  # Change this to your model name, e.g. "mistral", "codellama", etc.
#!ollama pull {model}
print("Done!")

Done!


# 6. Prompt Testing

In this cell you define your various parameters. These include your model, the column that has text passages, your prompt, and whether you want to use a structured output.

In [None]:
##### INPUT YOUR PARAMETERS HERE #####
MODEL_NAME = model 
COLUMN_NAME = "TEXT"   # Change dataframe column name here
PROMPT_TEMPLATE = "Is this passage from a story? Answer 1 for yes or 0 for no {text}" #Change your prompt here
STRUCTURED = False
LABELS = ["1", "0"]

## 7. Test a random passage

The cell chooses a random passage from the .csv and outputs the answer. You can run multiple times to keep testing answers on random passages.

In [None]:
import random
import random
import requests
import ast

def query_ollama(text):
    """Query local ollama model with text"""
    url = "http://localhost:11434/api/generate"
    prompt = PROMPT_TEMPLATE.format(text=text)
    
    data = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "stream": False,
        "format": {
            "type": "object",
            "properties": {
                "label": {
                    "type": "string",
                    "enum" : LABELS
                },
            },
            "required": [
                    "label",
                ]
        } if STRUCTURED else ''
    }
    
    try:
        # Check if model exists
        model_url = "http://localhost:11434/api/tags"
        models = requests.get(model_url).json()
        available_models = [model['name'] for model in models['models']]
        
        if MODEL_NAME not in available_models:
            print(f"❌ Model '{MODEL_NAME}' not found.")
            print(f"Available models: {', '.join(available_models)}")
            print(f"\nTo install {MODEL_NAME}, run this in terminal:")
            print(f"ollama pull {MODEL_NAME}")
            return None

        response = requests.post(url, json=data)
        if response.status_code == 404:
            print("❌ Ollama service not running.")
            print("Start ollama by running 'ollama serve' in terminal")
            return None

        result = response.json()
        if STRUCTURED:
            return ast.literal_eval(result['response'])['label']
        return result['response']

    except requests.exceptions.ConnectionError:
        print("❌ Cannot connect to Ollama")
        print("1. Check if Ollama is installed") 
        print("2. Start Ollama by running 'ollama serve' in terminal")
        return None
    except Exception as e:
        print(f"❌ Error: {str(e)}")
        return None

def analyze_random_text(df):
  """Analyze a random text from dataset"""
  random_idx = random.randint(0, len(df)-1)
  text = df.iloc[random_idx][COLUMN_NAME]
  print("\n📖 SAMPLE PASSAGE:")
  print(text)
  print("\n🤖 MODEL RESPONSE:")
  return query_ollama(text)

# Run
result = analyze_random_text(df)
if result:
    print(result)

# 8. Run your prompt on all of your data

In this cell you will run your prompt on the full data. The outputs will be stored as a new column named after the model you are using. In the next cell you can view those results. The cell will output "Completed" when complete.

** Note this takes parameters from Cell 6. Prompt Testing. If you want to change them go up and rerun that cell.

In [None]:
import requests

def query_ollama(text):
    """Query local ollama model with text"""
    url = "http://localhost:11434/api/generate"
    prompt = PROMPT_TEMPLATE.format(text=text)
    
    data = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "stream": False,
        "format": {
            "type": "object",
            "properties": {
                "label": {
                    "type": "string",
                    "enum" : LABELS
                },
            },
            "required": [
                    "label",
            ]
        } if STRUCTURED else ''
    }
    
    try:
        # Check if model exists
        model_url = "http://localhost:11434/api/tags"
        models = requests.get(model_url).json()
        available_models = [model['name'] for model in models['models']]
        
        if MODEL_NAME not in available_models:
            print(f"❌ Model '{MODEL_NAME}' not found.")
            return None

        response = requests.post(url, json=data)
        if response.status_code == 404:
            print("❌ Ollama service not running.")
            return None

        result = response.json()
        if STRUCTURED:
            return ast.literal_eval(result['response'])['label']
        return result['response']
    except requests.exceptions.ConnectionError:
        print("❌ Cannot connect to Ollama")
        return None
    except Exception as e:
        print(f"❌ Error: {str(e)}")
        return None

def analyze_all_texts(df):
    """Analyze all texts in the dataframe"""
    # Create new column for responses using model name
    df[MODEL_NAME] = df[COLUMN_NAME].apply(query_ollama)
    return df

# Run analysis on all rows
sample_df = analyze_all_texts(df)
print("Completed!")

# 9. Inspect your outputs

You can quickly scan your results by printing out the first N examples. Change the final integer to print more or less. Shows the passage + prompt output.

In [None]:
print(sample_df[[COLUMN_NAME, MODEL_NAME]].head(3))

Print a single passage by row number.

In [None]:
# Display a specific row (change row_number to view different rows)
row_number = 2  # Change this number to view different rows
print(f"\nDetailed view of row {row_number}:")
print(f"\nTEXT:\n{sample_df[COLUMN_NAME].iloc[row_number]}")
print(f"\n{MODEL_NAME} response:\n{sample_df[MODEL_NAME].iloc[row_number]}")

## 10. Clean your outputs to standardize them

## 11.Plot the distribution of your labels using a bar graph

Do this if you have structured, limited numbers of classes.


## 12. Compare your outputs to GPT outputs

Do this if you are comparing multiple models. 

Options: 

You could measure F1, precision and recall using the GPT answers as the reference. 

If you are using more open-ended responses you could ask Claude/GPT to write code to analyze the semantic similarity of your outputs and compare them to GPT's outputs. Which models are closest in similarity to GPTs responses? Are there any outlier models?

## 13. Output your .csv to your Jupyter directory

You can do this to inpsect your data more, review outputs or upload your data to Claude / GPT for visualization or further analysis.

In [None]:
# Save the DataFrame to a CSV file
output_file = "sample_df_output.csv"  # Specify the output file name
sample_df.to_csv(output_file, index=False)  # Set index=False to exclude the DataFrame index

print(f"DataFrame saved as {output_file}")