<a href="https://colab.research.google.com/github/mahb97/joyce-dubliners-similes-analysis/blob/main/01_data_processing_linguistic_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Computational Analysis of Simile Structures in Joyce's Dubliners and the BNC
## Notebook 1: Data Processing and preparation for Linguistic Analysis

**As part of the MA Dissertion in Digital Humanities at UCL, 2025**

From the Research Dialogue *Language Games* in HOLO 3, text by Nora N Khan & Peli Grietzer: *"To spend time obsessing over language, whether written, inscribed, coded, encoded, or spoken, is to feel acutely aware of its limits: the places that language fails, the territories of experiences it outlines. The way acceptable language shifts with one’s time. To spend time in computational spaces is to run into constant frustration with languages elisions, its elusiveness,  its – for lack of better words – ineffable qualities, its indeterminacy. That pesky language is ever expressed in relation, tied to experience becoming real, fixed for only a moment when printed. As authors have written here, computation and code have fairly low tolerance for elisions and variance in meaning, for what can be changed on a dime. Language always gestures to places beyond it, of spiritual, transcendent, mystic experiences, even as it makes the world. (...) Through a range of gestures, coded movements, still we manage to communicate outside of dominant symbolic language systems. Perhaps it must be put into words, to acknowledge that we routinely communicate through sounds, drawing, touch, embodied expression; ideas are communicated through architecture and the design of public and private spaces, driven by unnamed agendas. Our ideas also circulate through digital artefacts, through ghosts and ether, through performance and gestures and **half-smiles**. Language has to work for us, tasked with helping our species survive, perpetuate itself."*

This notebook implements the data processing pipeline for computational analysis of simile structures across four datasets: manual annotations, computational extractions, and BNC baseline corpus. The analysis extends the theoretical framework of Leech & Short (1981) on Quasi-Similes and frames these as discoverable in Joyce when reading is carried out with sympathy, as proposed by Tanja Vesala-Varttala (1999).

NLP is "focused on the desig and analysis of computational algorithms and representations for processing natural language" whereby its goal "is to provide new computational capabilities around human language: for example, extracting informatio from texts, translatig between languages, answering questions, holding a conversation, taking instructions and so on" (Introduction to Natural Language Processing, page 1-2).

Zipf's law is important here: "there will be a few words that are very frequent, and a long trail of words that are rare. A consequence is that Natural Language Processing algorithms must be especially robust to observations that do not occur in the training data" (ibid).




(insert tanja quote)

Insert Citations - Tanja, Leech and Short


## Colab Setup and GitHub Integration

## Upload Instructions

**To complete the setup:**

1. **Upload your 2 CSV files** to this Colab environment or your GitHub repository:
   - `All Similes  Dubliners cont.csv` (your manual annotations)
   - `concordance from BNC.csv` (BNC baseline)
   
2. Run the cells below to verify the files are loaded correctly.

3. **Run the full processing pipeline** once both files are available

The notebook will automatically:
- Download Dubliners from Project Gutenberg
- Run computational simile extraction
- Export processed datasets for Notebook 2

In [2]:
# Colab file upload setup
try:
    from google.colab import files
    import os

    print("Running in Google Colab")
    print("Current working directory:", os.getcwd())

    # Checks if required files already exist
    required_files = [
        'All Similes - Dubliners cont.csv', # Manual annotations
        'concordance from BNC.csv'                 # BNC baseline data
    ]

    missing_files = [f for f in required_files if not os.path.exists(f)]

    if missing_files:
        print("\nRequired data files not found. Please upload the following files:")
        for file in missing_files:
            print(f"  - {file}")
        print("\nRun the next cell to upload your files.")
    else:
        print("\nAll required files found:")
        for file in required_files:
            print(f"  FOUND: {file}")

except ImportError:
    print("Not running in Colab")
    print("Please ensure your CSV files are in the current directory")

    import os
    print(f"Current directory: {os.getcwd()}")
    print("Files in directory:")
    for file in os.listdir('.'):
        if file.endswith('.csv'):
            print(f"  {file}")

Running in Google Colab
Current working directory: /content

All required files found:
  FOUND: All Similes - Dubliners cont.csv
  FOUND: concordance from BNC.csv


# File upload cell - only run the below two cells if files are missing

In [None]:
# File upload cell - run this if files are missing
try:
    from google.colab import files
    import os

    print("Click 'Choose Files' to upload your CSV files")
    print("Upload both: manual annotations + BNC concordance")

    uploaded = files.upload()

    print("\nFiles uploaded successfully:")
    for filename in uploaded.keys():
        print(f"  {filename} ({len(uploaded[filename])} bytes)")

    print("\nRerun the verification cell below to check files")

except ImportError:
    print("Not in Colab - please place CSV files in current directory")

Click 'Choose Files' to upload your CSV files
Upload both: manual annotations + BNC concordance



Files uploaded successfully:

Rerun the verification cell below to check files


In [None]:
# Verify input data files are available - run this if the manual upload cell was needed
import os

required_input_files = [
    'All Similes - Dubliners cont copy.csv',  # Manual annotations
    'concordance from BNC.csv'                     # BNC baseline data
]

missing_files = []
for file in required_input_files:
    if not os.path.exists(file):
        missing_files.append(file)

if missing_files:
    print("WARNING: Missing required input data files:")
    for file in missing_files:
        print(f"  MISSING: {file}")
    print("\nPlease use the file upload cell above to upload these files:")
    print("  1. Your manual annotations CSV")
    print("  2. Your BNC concordance CSV")
    print("\nThe third dataset (computational extractions) will be generated by this notebook.")
else:
    print("All required input files found")
    print("  FOUND: Manual annotations ready")
    print("  FOUND: BNC baseline data ready")
    print("  GENERATE: Computational extractions will be created")

  MISSING: All Similes - Dubliners cont(Sheet1).csv

Please use the file upload cell above to upload these files:
  1. Your manual annotations CSV
  2. Your BNC concordance CSV

The third dataset (computational extractions) will be generated by this notebook.


# Installing required packages


In [3]:
# Installs required packages
!pip install spacy textblob scikit-learn -q
!python -m spacy download en_core_web_lg -q

# Verifies the file structure
import os
print("Current working directory:", os.getcwd())
print("\nProject files:")
for file in os.listdir('.'):
    if file.endswith(('.csv', '.py', '.ipynb')):
        print(f"  ✓ {file}")

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.7/400.7 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_lg')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.
Current working directory: /content

Project files:
  ✓ concordance from BNC.csv
  ✓ All Similes - Dubliners cont.csv



# 1. Introduction and Research Objectives

1.1 Research Questions

How effectively can computational methods replicate manual expert identification of literary similes?What linguistic innovations distinguish Joycean similes from standard English usage patterns?

How do different extraction approaches (rule-based vs. pattern recognition) perform against ground truth annotations?
# 1.2 Theoretical Framework

The analysis employs a novel categorical framework

distinguishing:Standard Similes: Conventional comparative

*   constructionsJoycean Quasi-Similes: Epistemic and perception-based
*   structuresJoycean Silent Similes: Implicit comparisons through
*   comparisonsJoycean Framed Similes: Complex nested comparative
*  punctuationJoycean Quasi-Fuzzy: Approximate and hedge-based comparisons



Data Collection and Preprocessing
# 1 Corpus Selection
The study employs four distinct datasets providing comprehensive methodological coverage:

Manual Expert Annotations (Ground Truth): Close reading identification of 194 similes
Rule-Based Domain-Informed Extraction: Restrictive algorithmic targeting of manual findings
NLP Pattern Recognition: Less-restrictive computational extraction from Project Gutenberg text
British National Corpus Baseline: Standard English reference corpus (200 instances)

# 2.2 Methodological Approach
Following established corpus linguistic principles, the analysis implements both quantitative and qualitative assessment methods to ensure robust validation of computational approaches against human expert annotation.

# Finding similes in Dubliners : basic pattern approach

# 4. Comparative Methodology: NLP Pattern Recognition
# 4.1 Less-Restrictive Approach
To establish methodological comparison, this extraction pipeline implements general natural language processing patterns targeting all potential simile constructions without domain-specific constraints.

# 4.2 Linguistic Feature Analysis
This approach incorporates comprehensive linguistic analysis including:

Lemmatization and POS tagging using spaCy
Sentiment analysis via TextBlob
Topic modeling using Latent Dirichlet Allocation
Pre/post-comparator token analysis for structural assessment

# 4.3 Research Significance
The comparison between restrictive domain-informed and general pattern recognition approaches provides insight into the specificity requirements for literary computational analysis.

In [4]:
# =============================================================================
# LESS RESTRICTIVE NLP SIMILE EXTRACTION
# Target: Find all instances of 'like', 'as if', and 'as...as' in Dubliners
# Purpose: Generate a dataset for comparison with the rule-based extraction
# =============================================================================

import spacy
import pandas as pd
import requests
import re
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import warnings
warnings.filterwarnings('ignore')

print("LESS RESTRICTIVE NLP SIMILE EXTRACTION")
print("Targeting all 'like', 'as if', 'as...as', and other potential comparative instances.")
print("Includes basic linguistic analysis (lemmatization, POS, sentiment, topic)")
print("=" * 65)

# Initialize spaCy
try:
    nlp = spacy.load("en_core_web_sm")
    print("spaCy natural language processing pipeline loaded successfully")
except OSError:
    print("Warning: spaCy English model not found. Install with: python -m spacy download en_core_web_sm")
    nlp = None


def load_dubliners_text():
    """Load Dubliners text from Project Gutenberg."""
    url = "https://www.gutenberg.org/files/2814/2814-0.txt"
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        text = response.text

        # Clean metadata
        start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
        end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

        if start_marker in text:
            text = text.split(start_marker)[1]
        if end_marker in text:
            text = text.split(end_marker)[0]

        print(f"Downloaded {len(text):,} characters from Project Gutenberg")
        return text
    except Exception as e:
        print(f"Error loading text: {e}")
        return None

def extract_similes_nlp_basic(text):
    """
    Extract similes using basic NLP patterns ('like', 'as if', 'as...as', etc.).
    This version is intentionally less restrictive to find a broader set of potential similes.
    Performs lemmatization, POS tagging, and sentiment analysis.
    """
    if nlp is None:
        print("spaCy not loaded. Cannot perform detailed NLP analysis.")
        # Fallback to regex-based sentence splitting if spaCy is not available
        sentences = [s.strip() for s in re.split(r'(?<!Mr)(?<!Mrs)(?<!Dr)[.!?]+', text) if len(s.strip()) > 10]
    else:
        # Use spaCy's sentence segmentation
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]


    basic_similes = []
    simile_id = 1

    print("Extracting similes with less restrictive NLP patterns...")

    # Definitions of the patterns that are being searched for
    inclusive_patterns = [
        r'\b(like)\b',         # 'like' as a potential comparator
        r'\b(as if)\b',        # 'as if'
        r'\b(as .*? as)\b',    # 'as ... as' (non-greedy)
        r'\b(than)\b',         # 'than' for comparison
        r'\b(similar to)\b',   # 'similar to'
        r'\b(resembled?)\b',   # 'resembled' or 'resembles'
        r'\b(seem|seems|seemed)\b', # Seem/appears for quasi-similes
        r'\b(appear|appears|appeared)\b',
        r'\b(such as)\b'       # 'such as'
    ]

    # Combine patterns into a single regex for efficiency
    valid_patterns = [p for p in inclusive_patterns if p.strip()]
    if not valid_patterns:
        print("No valid patterns defined for extraction.")
        return []

    combined_pattern = '|'.join(valid_patterns)

    for sentence in sentences:
        sent_lower = sentence.lower()
        found_comparators = []

        # Find all matches for the combined pattern
        matches = list(re.finditer(combined_pattern, sent_lower))

        # Process matches to get unique comparators found in the sentence
        for match in matches:
            # The comparator is the full matched string.
            comparator = match.group(0).strip()
            # Refines comparator type if it's a grouped pattern
            if match.lastindex is not None:
                 # For patterns like (as .*? as), it captures the full match
                 if match.group(1) == 'as .*? as':
                      comparator = match.group(0).strip()
                 else:
                      comparator = match.group(match.lastindex).strip() # Gets the content of the last captured group


            if comparator:
                found_comparators.append(comparator)

        # If any comparator was found, add the sentence
        if found_comparators:
            # Join multiple comparators if present, or just take the first one found for simplicity
            # in the 'Comparator_Type' column, but note all in 'Additional_Notes'
            main_comparator = found_comparators[0]
            all_comparators_note = f"Comparators found: {', '.join(sorted(list(set(found_comparators))))}"

            # Determine a basic type - this is less critical for 'less restrictive'
            if 'like' in main_comparator:
                 simile_type = 'like_pattern_nlp'
            elif 'as if' in main_comparator:
                 simile_type = 'as_if_pattern_nlp'
            elif ' as ' in main_comparator and ' as' in main_comparator: # Heuristic for as...as
                 simile_type = 'as_as_pattern_nlp'
            elif 'than' in main_comparator:
                 simile_type = 'than_pattern_nlp'
            elif 'seem' in main_comparator or 'appear' in main_comparator:
                 simile_type = 'quasi_pattern_nlp'
            elif 'similar' in main_comparator or 'resembl' in main_comparator:
                 simile_type = 'resemblance_pattern_nlp'
            else:
                 simile_type = 'other_pattern_nlp'


            # Performs a basic linguistic analysis
            lemmatized = ""
            pos_tags = ""
            sentiment_polarity = 0.0
            sentiment_subjectivity = 0.0
            total_tokens = 0
            pre_tokens = 0
            post_tokens = 0
            pre_post_ratio = 0.0

            if nlp:
                doc_sent = nlp(sentence)
                lemmatized = ' '.join([token.lemma_.lower() for token in doc_sent if not token.is_space and not token.is_punct and not token.is_stop])
                pos_tags = '; '.join([token.pos_ for token in doc_sent if not token.is_space])
                total_tokens = len([token for token in doc_sent if not token.is_space and not token.is_punct])

                # Estimate pre/post tokens based on first comparator location found
                comparator_token_index = None
                # Find the first token that is part of any found comparator
                for i, token in enumerate(doc_sent):
                    # Ensure token text is not just punctuation if excluding punctuation
                    if not token.is_punct and any(re.search(r'\b' + re.escape(comp_part) + r'\b', token.text.lower()) for comp in found_comparators for comp_part in comp.split() if comp_part not in [':',';','—','...','…']): # Ensure punctuation comparators are not used here either
                         comparator_token_index = i
                         break


                if comparator_token_index is not None:
                    pre_tokens = len([token for i, token in enumerate(doc_sent) if i < comparator_token_index and not token.is_space and not token.is_punct])
                    post_tokens = len([token for i, token in enumerate(doc_sent) if i > comparator_token_index and not token.is_space and not token.is_punct])
                else:
                     # Fallback if comparator token not found precisely
                    pre_tokens = total_tokens // 2
                    post_tokens = total_tokens - pre_tokens


                pre_post_ratio = pre_tokens / (post_tokens if post_tokens > 0 else 1)


            # Sentiment analysis using TextBlob
            blob = TextBlob(sentence)
            sentiment_polarity = blob.sentiment.polarity
            sentiment_subjectivity = blob.sentiment.subjectivity


            basic_similes.append({
                'ID': f'NLP-{simile_id:04d}',
                'Story': 'Unknown', # Cannot reliably split stories without more rules
                'Sentence_Context': sentence,
                'Comparator_Type': main_comparator, # Use the first found comparator
                'Category_Framework': 'NLP_LessRestrictive', # Updated category for this extraction
                'Additional_Notes': f'Less restrictive NLP extraction - {simile_type}. {all_comparators_note}',
                'Lemmatized_Text': lemmatized,
                'POS_Tags': pos_tags,
                'Sentiment_Polarity': sentiment_polarity,
                'Sentiment_Subjectivity': sentiment_subjectivity,
                'Total_Tokens': total_tokens,
                'Pre_Comparator_Tokens': pre_tokens,
                'Post_Comparator_Tokens': post_tokens,
                'Pre_Post_Ratio': pre_post_ratio
            })
            simile_id += 1

    print(f"Found {len(basic_similes)} potential similes using less restrictive NLP patterns.")
    return basic_similes

def perform_topic_modeling_nlp(df, n_topics=5):
    """
    Perform topic modeling on the less restrictive NLP extracted similes.
    """
    print(f"\nPERFORMING TOPIC MODELING ({n_topics} topics) on less restrictive NLP similes")
    print("-" * 40)

    # Use Lemmatized_Text if available, otherwise Sentence_Context
    texts = df['Lemmatized_Text'].dropna().astype(str).tolist()
    if not texts:
         texts = df['Sentence_Context'].dropna().astype(str).tolist()
         print("Using Sentence_Context for topic modeling as Lemmatized_Text is empty.")

    if len(texts) < n_topics:
        print(f"Warning: Insufficient data ({len(texts)}) for {n_topics} topics. Reducing to {len(texts)}")
        n_topics = min(n_topics, len(texts))
        if n_topics == 0:
            df['Topic_Label'] = 'No Data for Topic Modeling'
            print("No data for topic modeling.")
            return df
        print(f"Reduced topics to {n_topics}")


    # TF-IDF vectorization
    print("Performing TF-IDF vectorization...")
    vectorizer = TfidfVectorizer(
        max_features=500, # Increased features for potentially larger dataset
        stop_words='english',
        lowercase=True,
        ngram_range=(1, 2), # Include bigrams for richer context
        min_df=3, # Adjust min_df based on expected dataset size
        max_df=0.9
    )

    try:
        tfidf_matrix = vectorizer.fit_transform(texts)
        print(f"TF-IDF matrix created: {tfidf_matrix.shape}")

        # Latent Dirichlet Allocation
        lda = LatentDirichletAllocation(
            n_components=n_topics,
            random_state=42,
            max_iter=100, # Increased iterations
            learning_method='batch'
        )

        lda.fit(tfidf_matrix)

        # Extracts topic labels
        feature_names = vectorizer.get_feature_names_out()
        topic_labels = []

        print("Identified topics:")
        for topic_idx in range(n_topics):
            top_words = [feature_names[i] for i in lda.components_[topic_idx].argsort()[-5:]] # More words per topic
            topic_label = f"NLP_Topic_{topic_idx}: {', '.join(reversed(top_words))}"
            topic_labels.append(topic_label)
            print(f"  {topic_label}")

        # Assigns topics to texts
        topic_probs = lda.transform(tfidf_matrix)
        dominant_topics = topic_probs.argmax(axis=1)

        # Adds topic information back to dataframe
        topic_column = ['Unknown'] * len(df)
        valid_idx = 0
        text_col = 'Lemmatized_Text' if 'Lemmatized_Text' in df.columns else 'Sentence_Context'

        for i, (_, row) in enumerate(df.iterrows()):
            if pd.notna(row[text_col]):
                topic_column[i] = topic_labels[dominant_topics[valid_idx]]
                valid_idx += 1

        df['Topic_Label'] = topic_column

        print("Topic modeling analysis completed successfully")

    except Exception as e:
        print(f"Topic modeling failed: {e}")
        df['Topic_Label'] = 'Topic_Analysis_Failed'

    return df


# --- Execution ---
print("Starting less restrictive NLP simile extraction...")

# Loads the full text
dubliners_text = load_dubliners_text()

if dubliners_text:
    # Extracts similes using basic NLP patterns
    basic_similes_list = extract_similes_nlp_basic(dubliners_text)

    if basic_similes_list:
        basic_similes_df = pd.DataFrame(basic_similes_list)

        # Performs topic modeling
        basic_similes_df = perform_topic_modeling_nlp(basic_similes_df, n_topics=10) # Increased topics

        # Adds Dataset_Source column
        basic_similes_df['Dataset_Source'] = 'NLP_LessRestrictive_Extraction' # Updated source label


        # Saves results
        filename = 'dubliners_nlp_less_restrictive_extraction.csv' # Updated filename
        basic_similes_df.to_csv(filename, index=False)

        print(f"\nLESS RESTRICTIVE NLP EXTRACTION COMPLETED")
        print(f"Total instances extracted: {len(basic_similes_df)}")
        print(f"Results saved to: {filename}")

        # Displays sample results
        print("\n=== SAMPLE RESULTS (LESS RESTRICTIVE NLP) ===")
        display(basic_similes_df.head())

        print("\nReady for comparison with the rule-based extraction and manual annotations.")

    else:
        print("\nNo similes extracted using less restrictive NLP patterns.")
else:
    print("\nFailed to load Dubliners text for less restrictive NLP extraction.")

print("\nLESS RESTRICTIVE NLP EXTRACTION PIPELINE FINISHED")
print(f"Check for the CSV file: {filename}")

LESS RESTRICTIVE NLP SIMILE EXTRACTION
Targeting all 'like', 'as if', 'as...as', and other potential comparative instances (excluding punctuation)
Includes basic linguistic analysis (lemmatization, POS, sentiment, topic)
spaCy natural language processing pipeline loaded successfully
Starting less restrictive NLP simile extraction...
Downloaded 377,717 characters from Project Gutenberg
Extracting similes with less restrictive NLP patterns...
Found 330 potential similes using less restrictive NLP patterns.

PERFORMING TOPIC MODELING (10 topics) on less restrictive NLP similes
----------------------------------------
Performing TF-IDF vectorization...
TF-IDF matrix created: (330, 259)
Identified topics:
  NLP_Topic_0: like, say, mr, gabriel, word
  NLP_Topic_1: world, house, hear, girl, like
  NLP_Topic_2: like, man, head, know, come
  NLP_Topic_3: speak, soon, begin, home, mrs
  NLP_Topic_4: mr, far, mr kernan, kernan, mean
  NLP_Topic_5: appear, eye, like, aunt, old
  NLP_Topic_6: good,

Unnamed: 0,ID,Story,Sentence_Context,Comparator_Type,Category_Framework,Additional_Notes,Lemmatized_Text,POS_Tags,Sentiment_Polarity,Sentiment_Subjectivity,Total_Tokens,Pre_Comparator_Tokens,Post_Comparator_Tokens,Pre_Post_Ratio,Topic_Label,Dataset_Source
0,NLP-0001,Unknown,"It had always\r\nsounded strangely in my ears,...",like,NLP_LessRestrictive,Less restrictive NLP extraction - like_pattern...,sound strangely ear like word gnomon euclid wo...,PRON; AUX; ADV; VERB; ADV; ADP; PRON; NOUN; PU...,-0.05,0.15,22,8,13,0.615385,"NLP_Topic_0: like, say, mr, gabriel, word",NLP_LessRestrictive_Extraction
1,NLP-0002,Unknown,But now it sounded to me like the\r\nname of s...,like,NLP_LessRestrictive,Less restrictive NLP extraction - like_pattern...,sound like maleficent sinful,CCONJ; ADV; PRON; VERB; ADP; PRON; ADP; DET; N...,0.0,0.0,15,6,8,0.75,"NLP_Topic_1: world, house, hear, girl, like",NLP_LessRestrictive_Extraction
2,NLP-0003,Unknown,While my aunt was ladling out my stirabout he ...,as if,NLP_LessRestrictive,Less restrictive NLP extraction - as_if_patter...,aunt ladle stirabout say return remark exactly,SCONJ; PRON; NOUN; AUX; VERB; ADP; PRON; NOUN;...,0.125,0.125,27,10,16,0.625,"NLP_Topic_5: appear, eye, like, aunt, old",NLP_LessRestrictive_Extraction
3,NLP-0004,Unknown,so I continued eating as if the\r\nnews had no...,as if,NLP_LessRestrictive,Less restrictive NLP extraction - as_if_patter...,continue eat news interest,ADV; PRON; VERB; VERB; SCONJ; SCONJ; DET; NOUN...,-0.125,0.5,12,4,7,0.571429,"NLP_Topic_7: say, say like, long, like, mr",NLP_LessRestrictive_Extraction
4,NLP-0005,Unknown,"“I wouldn’t like children of mine,” he said, “...",like,NLP_LessRestrictive,Less restrictive NLP extraction - like_pattern...,like child say man like mean mr cotter ask aunt,PUNCT; PRON; AUX; PART; VERB; NOUN; ADP; NOUN;...,-0.05625,0.44375,29,3,25,0.12,"NLP_Topic_4: mr, far, mr kernan, kernan, mean",NLP_LessRestrictive_Extraction



Ready for comparison with the rule-based extraction and manual annotations.

LESS RESTRICTIVE NLP EXTRACTION PIPELINE FINISHED
Check for the CSV file: dubliners_nlp_less_restrictive_extraction.csv


# 2 Computational Extraction Pipeline
# 2.1 Algorithm Development
The rule-based simile extraction algorithm specifically targets the 194 instances identified through manual reading, implementing:

Precision-focused pattern matching for 'like' constructions (91 instances)
Contextual analysis for 'as if' patterns (38 instances)
Conservative extraction of Joycean Silent similes (6 instances: colon, en-dash, ellipsis)
Semantic classification of resemblance and quasi-simile patterns

# 3.2 Validation Strategy
The extraction pipeline employs F1 score analysis to quantify agreement between computational and manual identification, providing measurable validation of algorithmic effectiveness.

# JOYCE RULE-BASED SIMILE EXTRACTION ALGORITHM

In [None]:
# =============================================================================
# JOYCE RULE - BASED SIMILE EXTRACTION ALGORITHM
# Target: Match manual reading findings (~194 similes)
# Key insight: Only extract what manual reading actually confirmed as similes
# =============================================================================

import spacy
import pandas as pd
import requests
import re

print("SIMILE EXTRACTION ALGORITHM")
print("Targeting manual reading findings: 194 total similes")
print("- like: 91 instances")
print("- as if: 38 instances")
print("- Joycean_Silent: only 6 instances (2 colon, 2 en-dash, 2 ellipsis)")
print("=" * 65)

try:
    nlp = spacy.load("en_core_web_sm")
except:
    nlp = None

def load_and_split_dubliners():
    """Load and split Dubliners text."""
    url = "https://www.gutenberg.org/files/2814/2814-0.txt"
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        text = response.text

        # Clean metadata
        start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
        end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

        if start_marker in text:
            text = text.split(start_marker)[1]
        if end_marker in text:
            text = text.split(end_marker)[0]

        return text
    except Exception as e:
        print(f"Error loading text: {e}")
        return None

def extract_like_similes(text):
    """
    Extract 'like' similes - should find ~91 instances to match manual data.
    Be more inclusive since these are confirmed similes in manual reading.
    """
    if nlp is None:
        sentences = [s.strip() for s in re.split(r'[.!?]+', text) if len(s.strip()) > 10]
    else:
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]

    like_similes = []

    for sentence in sentences:
        if ' like ' in sentence.lower():
            # Include most 'like' instances since manual reading confirmed them as similes
            # Only exclude obvious non-similes
            sent_lower = sentence.lower()

            # Minimal exclusions - only clear non-similes
            exclude_patterns = [
                'would like to', 'i would like', 'you would like',
                'feel like going', 'look like you', 'seem like you'
            ]

            if not any(pattern in sent_lower for pattern in exclude_patterns):
                like_similes.append({
                    'text': sentence,
                    'type': 'like_simile',
                    'comparator': 'like',
                    'theoretical_category': 'Standard'
                })

    return like_similes

def extract_as_if_similes(text):
    """
    Extract 'as if' similes - should find ~38 instances to match manual data.
    Include both Standard and Joycean_Quasi based on context.
    """
    if nlp is None:
        sentences = [s.strip() for s in re.split(r'[.!?]+', text) if len(s.strip()) > 10]
    else:
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]

    as_if_similes = []

    for sentence in sentences:
        if 'as if' in sentence.lower():
            sent_lower = sentence.lower()

            # Determine if Standard or Joycean_Quasi based on context
            quasi_indicators = [
                'continued', 'observation', 'returning to', 'to listen',
                'the news had not', 'under observation'
            ]

            if any(indicator in sent_lower for indicator in quasi_indicators):
                category = 'Joycean_Quasi'
            else:
                category = 'Standard'

            as_if_similes.append({
                'text': sentence,
                'type': 'as_if_simile',
                'comparator': 'as if',
                'theoretical_category': category
            })

    return as_if_similes

def extract_seemed_similes(text):
    """
    Extract 'seemed' similes - should find ~9 instances.
    These are typically Joycean_Quasi.
    """
    if nlp is None:
        sentences = [s.strip() for s in re.split(r'[.!?]+', text) if len(s.strip()) > 10]
    else:
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]

    seemed_similes = []

    for sentence in sentences:
        sent_lower = sentence.lower()
        if 'seemed' in sent_lower or 'seem' in sent_lower:
            # Only count if it has comparative elements
            if any(word in sent_lower for word in ['like', 'as if', 'to be', 'that']):
                seemed_similes.append({
                    'text': sentence,
                    'type': 'seemed_simile',
                    'comparator': 'seemed',
                    'theoretical_category': 'Joycean_Quasi'
                })

    return seemed_similes

def extract_as_adj_as_similes(text):
    """
    Extract 'as...as' constructions - should find ~9-12 instances.
    Exclude pure measurements and quantities.
    """
    if nlp is None:
        sentences = [s.strip() for s in re.split(r'[.!?]+', text) if len(s.strip()) > 10]
    else:
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]

    as_as_similes = []

    for sentence in sentences:
        # Find 'as [adjective] as' patterns
        as_adj_as_pattern = re.search(r'\bas\s+(\w+)\s+as\s+', sentence.lower())
        if as_adj_as_pattern:
            adj = as_adj_as_pattern.group(1)

            # Exclude temporal, quantitative, and causal uses
            exclude_words = [
                'long', 'soon', 'far', 'much', 'many', 'well', 'poor',
                'good', 'bad', 'big', 'small', 'old', 'young'
            ]

            # Include descriptive adjectives that create genuine comparisons
            if adj not in exclude_words:
                as_as_similes.append({
                    'text': sentence,
                    'type': 'as_adj_as',
                    'comparator': 'as ADJ as',
                    'theoretical_category': 'Standard'
                })

    return as_as_similes

def extract_joycean_silent_precise(text):
    """
    Extract ONLY the 6 Joycean_Silent similes found in manual reading.
    Be extremely conservative - target specific known patterns.
    """
    if nlp is None:
        sentences = [s.strip() for s in re.split(r'[.!?]+', text) if len(s.strip()) > 20]
    else:
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 20]

    silent_similes = []

    # Known Silent simile patterns from manual reading
    known_patterns = [
        'no hope for him this time',
        'customs were strange',
        'certain ... something',
        'faint fragrance escaped',
        'not ungallant figure',
        'expression changed'
    ]

    for sentence in sentences:
        # Only extract if very similar to known examples
        sent_lower = sentence.lower()

        # Check for colon patterns
        if ':' in sentence:
            if any(pattern in sent_lower for pattern in known_patterns[:3]):
                silent_similes.append({
                    'text': sentence,
                    'type': 'silent_colon',
                    'comparator': 'colon',
                    'theoretical_category': 'Joycean_Silent'
                })

        # Check for en-dash patterns
        elif '—' in sentence or ' - ' in sentence:
            if any(pattern in sent_lower for pattern in known_patterns[1:4]):
                silent_similes.append({
                    'text': sentence,
                    'type': 'silent_dash',
                    'comparator': 'en dash',
                    'theoretical_category': 'Joycean_Silent'
                })

        # Check for ellipsis patterns
        elif '...' in sentence:
            if any(pattern in sent_lower for pattern in known_patterns[2:]):
                silent_similes.append({
                    'text': sentence,
                    'type': 'silent_ellipsis',
                    'comparator': 'ellipsis',
                    'theoretical_category': 'Joycean_Silent'
                })

    return silent_similes

def extract_other_patterns(text):
    """
    Extract remaining patterns from manual data:
    - like + like (2 instances)
    - resembl* (3 instances)
    - similar, somewhat, etc.
    """
    if nlp is None:
        sentences = [s.strip() for s in re.split(r'[.!?]+', text) if len(s.strip()) > 10]
    else:
        doc = nlp(text)
        sentences = [sent.text.strip() for sent in doc.sents if len(sent.text.strip()) > 10]

    other_similes = []

    for sentence in sentences:
        sent_lower = sentence.lower()

        # Doubled 'like' patterns
        if sent_lower.count(' like ') >= 2:
            other_similes.append({
                'text': sentence,
                'type': 'doubled_like',
                'comparator': 'like + like',
                'theoretical_category': 'Joycean_Framed'
            })

        # Resemblance patterns
        elif any(word in sent_lower for word in ['resembl', 'similar', 'resemble']):
            other_similes.append({
                'text': sentence,
                'type': 'resemblance',
                'comparator': 'resembl*',
                'theoretical_category': 'Joycean_Quasi_Fuzzy'
            })

        # Other rare patterns
        elif 'somewhat' in sent_lower:
            other_similes.append({
                'text': sentence,
                'type': 'somewhat',
                'comparator': 'somewhat',
                'theoretical_category': 'Joycean_Quasi_Fuzzy'
            })

        # Compound adjectives with -like
        elif re.search(r'\w+like\b', sent_lower):
            like_match = re.search(r'(\w+like)\b', sent_lower)
            if like_match:
                other_similes.append({
                    'text': sentence,
                    'type': 'compound_like',
                    'comparator': '(-)like',
                    'theoretical_category': 'Standard'
                })

    return other_similes

def extract_all_similes_rulebased(text):
    """
    Extract all similes using algorithm targeting manual findings.
    Expected total: ~194 similes (not 355).
    """

    print("Extracting similes with algorithm...")

    results = {
        'like_similes': extract_like_similes(text),
        'as_if_similes': extract_as_if_similes(text),
        'seemed_similes': extract_seemed_similes(text),
        'as_adj_as_similes': extract_as_adj_as_similes(text),
        'silent_similes': extract_joycean_silent_precise(text),
        'other_patterns': extract_other_patterns(text)
    }

    return results

def split_into_stories_fixed(full_text):
    """Split Dubliners into individual stories with proper breakdown."""
    # Clean metadata
    start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
    end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

    if start_marker in full_text:
        full_text = full_text.split(start_marker)[1]
    if end_marker in full_text:
        full_text = full_text.split(end_marker)[0]

    story_titles = [
        "THE SISTERS", "AN ENCOUNTER", "ARABY", "EVELINE",
        "AFTER THE RACE", "TWO GALLANTS", "THE BOARDING HOUSE",
        "A LITTLE CLOUD", "COUNTERPARTS", "CLAY", "A PAINFUL CASE",
        "IVY DAY IN THE COMMITTEE ROOM", "A MOTHER", "GRACE", "THE DEAD"
    ]

    stories = {}
    for i, title in enumerate(story_titles):
        # Find story start
        story_start = None
        patterns = [
            rf'\n\s*{re.escape(title)}\s*\n\n',
            rf'\n\s*{re.escape(title)}\s*\n'
        ]

        for pattern in patterns:
            match = re.search(pattern, full_text, re.MULTILINE)
            if match:
                story_start = match.end()
                break

        if story_start is None and title in full_text:
            pos = full_text.find(title)
            story_start = full_text.find('\n', pos) + 1

        if story_start is None:
            continue

        # Find story end
        story_end = len(full_text)
        for next_title in story_titles[i+1:]:
            if next_title in full_text:
                next_pos = full_text.find(next_title, story_start)
                if next_pos > story_start:
                    story_end = next_pos
                    break

        story_content = full_text[story_start:story_end].strip()
        if len(story_content) > 200:
            stories[title] = story_content
            print(f"Found {title}: {len(story_content):,} characters")

    return stories

def process_dubliners_rulebased():
    """
    Process Dubliners with rule-based extraction and story-by-story breakdown.
    """
    print("\nLOADING DUBLINERS TEXT")
    print("-" * 25)

    # Load full text
    url = "https://www.gutenberg.org/files/2814/2814-0.txt"
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        full_text = response.text
        print(f"Downloaded {len(full_text):,} characters from Project Gutenberg")
    except Exception as e:
        print(f"Error loading text: {e}")
        return None

    print("\nSPLITTING INTO STORIES")
    print("-" * 22)

    # Split into individual stories
    stories = split_into_stories_fixed(full_text)
    print(f"Successfully found {len(stories)} stories")

    if len(stories) == 0:
        print("No stories found")
        return None

    print("\nEXTRACTING SIMILES")
    print("-" * 47)

    # Process each story individually
    all_similes = []
    simile_id = 1

    for story_title, story_text in stories.items():
        print(f"\n--- Processing: {story_title} ---")

        # Extract similes from this story
        story_results = extract_all_similes_rulebased(story_text)

        # Count by category for this story
        story_category_counts = {}
        story_similes = []

        for category, similes in story_results.items():
            if len(similes) > 0:
                print(f"  {category}: {len(similes)} similes")

            for simile in similes:
                # Add story information
                simile_data = {
                    'ID': f'RULE-{simile_id:03d}', # Changed ID prefix
                    'Story': story_title,
                    'Page No.': 'Computed',
                    'Sentence Context': simile['text'],
                    'Comparator Type ': simile['comparator'],
                    'Category (Framwrok)': simile['theoretical_category'],
                    'Additional Notes': f'Rule-based extraction - {simile["type"]}', # Changed note
                    'CLAWS': '',
                    'Confidence_Score': 0.85,
                    'Extraction_Method': category
                }

                story_similes.append(simile_data)
                all_similes.append(simile_data)

                # Count categories
                cat = simile['theoretical_category']
                story_category_counts[cat] = story_category_counts.get(cat, 0) + 1

                simile_id += 1

        # Show story summary
        total_story_similes = len(story_similes)
        print(f"  Total similes found: {total_story_similes}")

        if story_category_counts:
            print("  Category breakdown:")
            for cat, count in sorted(story_category_counts.items()):
                print(f"    {cat}: {count}")

        # Show examples of novel categories if found
        for cat in ['Joycean_Silent', 'Joycean_Quasi', 'Joycean_Framed']:
            examples = [s for s in story_similes if s['Category (Framwrok)'] == cat]
            if examples:
                ex = examples[0]
                print(f"    {cat} example: {ex['Sentence Context'][:70]}...")

    print(f"\n=== COMPLETE RESULTS ===")
    print(f"Total similes extracted: {len(all_similes)}")
    print(f"Target from manual reading: 194")
    print(f"Difference: {len(all_similes) - 194}")

    if len(all_similes) == 0:
        print("No similes found")
        return pd.DataFrame()

    # Convert to DataFrame
    results_df = pd.DataFrame(all_similes)

    # Overall category breakdown
    category_counts = results_df['Category (Framwrok)'].value_counts()
    print(f"\n=== OVERALL CATEGORY BREAKDOWN ===")
    for category, count in sorted(category_counts.items()):
        percentage = (count / len(results_df)) * 100
        print(f"  {category}: {count} ({percentage:.1f}%)")

    # Compare with manual targets
    manual_targets = {
        'Standard': 93, 'Joycean_Quasi': 53, 'Joycean_Silent': 6,
        'Joycean_Framed': 18, 'Joycean_Quasi_Fuzzy': 13
    }

    print(f"\n=== COMPARISON WITH MANUAL TARGETS ===")
    for category, target in manual_targets.items():
        extracted = category_counts.get(category, 0)
        difference = extracted - target
        print(f"  {category}: extracted {extracted}, target {target}, diff {difference:+}")

    # Story coverage analysis
    print(f"\n=== STORY COVERAGE ANALYSIS ===")
    story_counts = results_df['Story'].value_counts()
    print(f"Stories with similes: {len(story_counts)}/15")
    for story, count in story_counts.items():
        print(f"  {story}: {count} similes")

    # Save results
    filename = 'dubliners_rulebased_extraction.csv' # Changed filename
    results_df.to_csv(filename, index=False)
    print(f"\nResults saved to: {filename}")

    # Show sample results by category
    print(f"\n=== SAMPLE RESULTS BY CATEGORY ===")
    for category in sorted(results_df['Category (Framwrok)'].unique()):
        print(f"\n{category} Examples:")
        samples = results_df[results_df['Category (Framwrok)'] == category].head(2)
        for i, (_, row) in enumerate(samples.iterrows(), 1):
            print(f"  {i}. {row['ID']} ({row['Story']}):")
            print(f"     {row['Sentence Context'][:80]}...")
            print(f"     Comparator: {row['Comparator Type ']}")

    return results_df

def load_and_split_dubliners():
    """Load and split Dubliners text."""
    url = "https://www.gutenberg.org/files/2814/2814-0.txt"
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        text = response.text

        # Clean metadata
        start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK"
        end_marker = "*** END OF THE PROJECT GUTENBERG EBOOK"

        if start_marker in text:
            text = text.split(start_marker)[1]
        if end_marker in text:
            text = text.split(end_marker)[0]

        return text
    except Exception as e:
        print(f"Error loading text: {e}")
        return None

# Execute rule-based extraction
print("Starting rule-based Joyce simile extraction...") # Changed print statement
results = process_dubliners_rulebased() # Changed function call

if results is not None and len(results) > 0:
    print("\nRULE-BASED EXTRACTION COMPLETED") # Changed print statement
    print("Results should be much closer to your manual findings of 194 similes")
    print("CSV file automatically saved: dubliners_rulebased_extraction.csv") # Changed filename
    print("Ready for F1 analysis and comparison with manual annotations")

    # Display final summary
    print("\nFINAL SUMMARY FOR THESIS:")
    print("=" * 75)
    total_similes = len(results)
    print(f"Total similes identified: {total_similes:,}")
    print(f"Target from manual reading: 194")
    print(f"Accuracy: {(194/total_similes)*100:.1f}%" if total_similes > 0 else "N/A")

    # Category analysis
    category_counts = results['Category (Framwrok)'].value_counts()
    joycean_categories = [cat for cat in category_counts.index if 'Joycean' in cat]
    joycean_total = sum(category_counts.get(cat, 0) for cat in joycean_categories)

    print(f"Joycean innovations detected: {joycean_total}")
    print(f"Innovation percentage: {(joycean_total/total_similes)*100:.1f}%" if total_similes > 0 else "N/A")
    print(f"Stories analyzed: {results['Story'].nunique()}/15 stories")
    print("Ready for computational vs manual comparison")

    print("\nNext steps:")
    print("1. Load manual annotations: /content/All Similes - Dubliners cont(Sheet1).csv")
    print("2. Load BNC baseline: /content/concordance from BNC.csv")
    print("3. Run F1 score analysis comparing computational vs manual")
    print("4. Generate comprehensive visualizations")

else:
    print("Extraction failed - no results generated")

print("\nRULE-BASED EXTRACTION PIPELINE FINISHED") # Changed print statement
print("Check for the CSV file: dubliners_rulebased_extraction.csv") # Changed filename

SIMILE EXTRACTION ALGORITHM
Targeting manual reading findings: 194 total similes
- like: 91 instances
- as if: 38 instances
- Joycean_Silent: only 6 instances (2 colon, 2 en-dash, 2 ellipsis)
Starting rule-based Joyce simile extraction...

LOADING DUBLINERS TEXT
-------------------------
Downloaded 397,269 characters from Project Gutenberg

SPLITTING INTO STORIES
----------------------
Found THE SISTERS: 16,791 characters
Found AN ENCOUNTER: 17,443 characters
Found ARABY: 12,541 characters
Found EVELINE: 9,822 characters
Found AFTER THE RACE: 12,795 characters
Found TWO GALLANTS: 21,586 characters
Found THE BOARDING HOUSE: 15,300 characters
Found A LITTLE CLOUD: 27,891 characters
Found COUNTERPARTS: 22,658 characters
Found CLAY: 13,952 characters
Found A PAINFUL CASE: 20,572 characters
Found IVY DAY IN THE COMMITTEE ROOM: 29,147 characters
Found A MOTHER: 25,702 characters
Found GRACE: 43,126 characters
Found THE DEAD: 87,674 characters
Successfully found 15 stories

EXTRACTING SIMILES

# 5. Baseline Corpus Integration
# 5.1 British National Corpus Processing
The BNC concordance data provides essential baseline measurements for distinguishing literary innovation from standard English usage patterns.

# 5.2 Category Harmonization
Manual categorization data from the BNC is preserved while implementing algorithmic fallback classification to ensure comprehensive coverage and comparability across all datasets.

# 5.3 Statistical Foundation
The BNC baseline enables robust statistical testing including chi-square analysis, two-proportion tests, and binomial testing to quantify significance of observed differences.

In [None]:
# =============================================================================
# BNC BASELINE DATASET GENERATION
# Target: Load BNC data and classify similes into Standard and Quasi_Similes
# Purpose: Create a baseline for comparison with Dubliners similes
# =============================================================================

import pandas as pd
import re
import os

print("BNC BASELINE DATASET GENERATION")
print("Targeting Standard and Quasi_Similes classification")
print("=" * 65)

def load_and_process_bnc_data(bnc_path="concordance from BNC.csv"):
    """
    Load BNC concordance data and classify similes into Standard and Quasi_Similes.

    Prioritizes the 'Category (Framework)' column from the input CSV if available,
    falling back to algorithmic classification otherwise.

    Args:
        bnc_path (str): Path to the BNC concordance CSV file.

    Returns:
        pd.DataFrame: DataFrame with processed BNC data.
    """
    print(f"\nLoading BNC data from: {bnc_path}")

    if not os.path.exists(bnc_path):
        print(f"Error: BNC file not found at {bnc_path}")
        return pd.DataFrame()

    try:
        # Load the BNC data
        # Use robust loading for potentially complex CSV
        bnc_df = pd.read_csv(
            bnc_path,
            encoding='utf-8',
            quotechar='"',
            skipinitialspace=True,
            engine='python' # Use python engine for better handling of quotes/commas in text
        )
        print(f"Successfully loaded {len(bnc_df)} instances from BNC data.")
        print(f"Original columns: {list(bnc_df.columns)}")

    except Exception as e:
        print(f"Error loading BNC data: {e}")
        return pd.DataFrame()

    # Ensure required columns are present (Index, Left, Node, Right, Genre)
    required_cols = ['Index', 'Left', 'Node', 'Right', 'Genre']
    if not all(col in bnc_df.columns for col in required_cols):
        print(f"Error: BNC data is missing required concordance columns. Found: {list(bnc_df.columns)}")
        # Try alternative column names if common ones aren't found
        if 'Index' not in bnc_df.columns and 'index' in bnc_df.columns:
            bnc_df = bnc_df.rename(columns={'index': 'Index'})
        if 'Node' not in bnc_df.columns and 'node' in bnc_df.columns:
            bnc_df = bnc_df.rename(columns={'node': 'Node'})
        if 'Genre' not in bnc_df.columns and 'genre' in bnc_df.columns:
            bnc_df = bnc_df.rename(columns={'genre': 'Genre'})

        # Re-check after potential renaming
        if not all(col in bnc_df.columns for col in required_cols):
             print(f"Critical Error: Still missing required columns after attempting renaming. Found: {list(bnc_df.columns)}")
             return pd.DataFrame()


    # Reconstruct Sentence Context
    # Handle potential NaN values in Left, Node, Right
    bnc_df['Left'] = bnc_df['Left'].fillna('').astype(str)
    bnc_df['Node'] = bnc_df['Node'].fillna('').astype(str)
    bnc_df['Right'] = bnc_df['Right'].fillna('').astype(str)

    bnc_df['Sentence_Context'] = (bnc_df['Left'] + ' ' +
                                   bnc_df['Node'] + ' ' +
                                   bnc_df['Right']).str.strip()

    # Determine Comparator Type from Node
    bnc_df['Comparator_Type'] = bnc_df['Node'].str.lower()

    # Classify into Standard and Quasi_Similes
    # PRIORITIZE 'Category (Framework)' from input CSV if available and not null/empty
    manual_category_col = 'Category (Framework)'
    algorithmic_categories = []

    for index, row in bnc_df.iterrows():
        manual_category = row.get(manual_category_col)

        if pd.notna(manual_category) and str(manual_category).strip() != '':
            # Use the manual tag if it exists and is not empty
            category = str(manual_category).strip()
            # Standardize common variations if needed, e.g., 'Standard' instead of 'standard'
            if category.lower() == 'standard':
                 category = 'Standard'
            elif category.lower() == 'quasi_similes':
                 category = 'Quasi_Similes'
            # Keep other manual categories as they are if they exist (e.g. for error checking)
        else:
            # Fallback to algorithmic classification if manual tag is missing or empty
            node = str(row['Node']).lower()
            # Simple rule: 'like', 'as', 'as if' are Standard, others are Quasi_Similes
            if node in ['like', 'as', 'as if']:
                category = 'Standard'
            else:
                # Anything else in the 'Node' column will be treated as Quasi_Similes
                # based on the user's goal to have this category for comparison.
                category = 'Quasi_Similes'
            # Add a note if algorithmic classification was used as fallback
            bnc_df.loc[index, 'Additional_Notes'] = 'Algorithmically classified (no manual tag)'


        algorithmic_categories.append(category)

    bnc_df['Category_Framework'] = algorithmic_categories # Assign the determined category

    # Add Dataset_Source column
    bnc_df['Dataset_Source'] = 'BNC_Baseline'

    # Select and rename columns to match the standardized format used elsewhere
    # Include the original manual column for comparison if it exists
    output_cols = [
        'Index', 'Sentence_Context', 'Comparator_Type', 'Category_Framework',
        'Genre', 'Dataset_Source'
    ]
    if manual_category_col in bnc_df.columns:
        output_cols.insert(output_cols.index('Category_Framework') + 1, manual_category_col)
        # Rename manual column for clarity in output if it exists
        processed_bnc_df = bnc_df.rename(columns={manual_category_col: 'Original_Manual_Category'})
        output_cols = [col if col != manual_category_col else 'Original_Manual_Category' for col in output_cols]
    else:
        processed_bnc_df = bnc_df.copy()


    # Ensure all selected columns exist before slicing
    output_cols_present = [col for col in output_cols if col in processed_bnc_df.columns]
    processed_bnc_df = processed_bnc_df[output_cols_present]


    print(f"\nProcessed BNC data: {len(processed_bnc_df)} instances")
    print(f"Processed columns: {list(processed_bnc_df.columns)}")
    print(f"Category distribution (after prioritizing manual tags): {processed_bnc_df['Category_Framework'].value_counts().to_dict()}")
    if 'Original_Manual_Category' in processed_bnc_df.columns:
         print(f"Original Manual Category distribution: {processed_bnc_df['Original_Manual_Category'].value_counts().to_dict()}")


    # Save the processed data (optional, but good practice)
    output_filename = "bnc_processed_similes.csv"
    processed_bnc_df.to_csv(output_filename, index=False)
    print(f"Processed BNC data saved to: {output_filename}")

    return processed_bnc_df

# Execute the BNC data processing
print("Starting BNC data processing...")
bnc_processed_df = load_and_process_bnc_data()

if not bnc_processed_df.empty:
    print("\nBNC data processing completed successfully.")
    print("The 'bnc_processed_df' DataFrame is ready for comparative analysis.")
    # Display a sample
    print("\nSample of processed BNC data:")
    display(bnc_processed_df.head())
else:
    print("\nBNC data processing failed or resulted in an empty DataFrame.")

print("\nBNC BASELINE DATASET GENERATION FINISHED")

BNC BASELINE DATASET GENERATION
Targeting Standard and Quasi_Similes classification
Starting BNC data processing...

Loading BNC data from: concordance from BNC.csv
Successfully loaded 200 instances from BNC data.
Original columns: ['Index', 'Left', 'Node', 'Right', 'Genre', 'Comparator Type', 'Category (Framework)']

Processed BNC data: 200 instances
Processed columns: ['Index', 'Sentence_Context', 'Comparator_Type', 'Category_Framework', 'Original_Manual_Category', 'Genre', 'Dataset_Source']
Category distribution (after prioritizing manual tags): {'Standard': 124, 'Quasi_Simile': 76}
Original Manual Category distribution: {'Standard': 124, 'Quasi_Simile': 76}
Processed BNC data saved to: bnc_processed_similes.csv

BNC data processing completed successfully.
The 'bnc_processed_df' DataFrame is ready for comparative analysis.

Sample of processed BNC data:


Unnamed: 0,Index,Sentence_Context,Comparator_Type,Category_Framework,Original_Manual_Category,Genre,Dataset_Source
0,BNClab1,It seemed very much like she'd given up even ...,like,Standard,Standard,fiction,BNC_Baseline
1,BNClab2,Memories like this seem to pour out of her an...,like,Standard,Standard,fiction,BNC_Baseline
2,BNClab3,You sound like me.,like,Standard,Standard,fiction,BNC_Baseline
3,BNClab4,My love like a poultice drawing out that sweet...,like,Standard,Standard,fiction,BNC_Baseline
4,BNClab5,I went this far because my hour with Hannah ha...,like + like,Standard,Standard,fiction,BNC_Baseline



BNC BASELINE DATASET GENERATION FINISHED
