# Computational Analysis of Simile Structures in Joyce's Dubliners
## Notebook 1: Data Processing and Linguistic Analysis

**Master's Dissertation Research - University College London**

This notebook implements the data processing pipeline for computational analysis of simile structures across three datasets: manual annotations, computational extractions, and BNC baseline corpus. The analysis extends the theoretical framework of Leech & Short (1981) with novel Joycean categories.

**Repository:** https://github.com/[username]/joyce-dubliners-similes-analysis

## Colab Setup and GitHub Integration

In [None]:
# Mount Google Drive for file access (if using Colab)
try:
    from google.colab import drive
    drive.mount('/content/drive')
    print("Google Drive mounted successfully")
    
    # Set working directory to your project folder
    import os
    os.chdir('/content/drive/MyDrive/joyce-dubliners-similes-analysis')
    print("Working directory set to project folder")
    
except ImportError:
    print("Not running in Colab - skipping Drive mount")

# For GitHub integration (run once to get initial files)
# Uncomment the following lines on first run:
# !git clone https://github.com/[username]/joyce-dubliners-similes-analysis.git
# %cd joyce-dubliners-similes-analysis

# For subsequent runs, pull latest changes:
# !git pull origin main

In [None]:
# Install required packages
!pip install spacy textblob scikit-learn -q
!python -m spacy download en_core_web_lg -q

# Verify file structure
import os
print("Current working directory:", os.getcwd())
print("\nProject files:")
for file in os.listdir('.'):
    if file.endswith(('.csv', '.py', '.ipynb')):
        print(f"  ✓ {file}")

## Core Imports and Configuration

In [None]:
import pandas as pd
import numpy as np
import spacy
import logging
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import re
import warnings
from pathlib import Path
import requests
from typing import Dict, List, Tuple, Optional

# Configure academic logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('simile_analysis.log'),
        logging.StreamHandler()
    ]
)
logger = logging.getLogger(__name__)

# Suppress non-critical warnings for cleaner output
warnings.filterwarnings('ignore', category=UserWarning)

# Initialize spaCy language model for linguistic processing
try:
    nlp = spacy.load("en_core_web_lg")
    logger.info("SpaCy English language model initialized successfully")
except OSError:
    logger.error("SpaCy model not found. Install with: python -m spacy download en_core_web_lg")
    raise

## Dataset Loading and Standardization Module

In [None]:
class SimileDataProcessor:
    """
    Handles loading, cleaning, and standardization of simile datasets for
    comparative analysis. Implements consistent data structures across
    manual annotations, computational extractions, and BNC baseline corpus.
    """
    
    def __init__(self):
        """Initialize the data processor with configuration parameters."""
        self.nlp = nlp
        self.processed_datasets = {}
        self.dataset_metadata = {}
        
        # Define theoretical framework categories for consistency
        self.theoretical_categories = {
            'standard': ['Standard', 'standard'],
            'quasi': ['Quasi', 'Joycean_Quasi', 'quasi'],
            'joycean_silent': ['Joycean_Silent', 'joycean_silent'],
            'joycean_hybrid': ['Joycean_Hybrid', 'Joycean_Quasi_Fuzzy', 'joycean_hybrid'],
            'joycean_framed': ['Joycean_Framed', 'joycean_framed'],
            'joycean_complex': ['Joycean', 'joycean']
        }
        
        logger.info("SimileDataProcessor initialized with theoretical framework categories")

In [None]:
    def load_manual_annotations(self, filepath: str) -> pd.DataFrame:
        """
        Load and process manual annotations from close reading analysis.
        
        This dataset serves as the gold standard for algorithmic validation,
        containing expert-annotated similes with theoretical categorization
        based on extended Leech & Short framework.
        
        Args:
            filepath (str): Path to manual annotations CSV file
            
        Returns:
            pd.DataFrame: Processed manual annotations dataset
        """
        logger.info("Loading manual annotations dataset")
        
        try:
            # Load with appropriate encoding for academic text
            manual_df = pd.read_csv(filepath, encoding='cp1252')
            
            # Standardize column names for consistency
            column_mapping = {
                'Category (Framwrok)': 'Category_Framework',  # Fix typo in original
                'Comparator Type ': 'Comparator_Type',  # Remove trailing space
                'Sentence Context': 'Sentence_Context',
                'Additional Notes': 'Additional_Notes',
                'Page No.': 'Page_Number'
            }
            manual_df = manual_df.rename(columns=column_mapping)
            
            # Clean and validate data
            manual_df = self._clean_manual_annotations(manual_df)
            
            # Add dataset identifier
            manual_df['Dataset_Source'] = 'Manual_Annotation'
            manual_df['Analysis_Method'] = 'Close_Reading'
            
            # Store metadata
            self.dataset_metadata['manual'] = {
                'total_instances': len(manual_df),
                'stories_covered': manual_df['Story'].nunique(),
                'categories_found': manual_df['Category_Framework'].nunique(),
                'date_processed': pd.Timestamp.now()
            }
            
            logger.info(f"Manual annotations loaded: {len(manual_df)} instances across {manual_df['Story'].nunique()} stories")
            
            return manual_df
            
        except Exception as e:
            logger.error(f"Failed to load manual annotations: {str(e)}")
            raise
    
    def _clean_manual_annotations(self, df: pd.DataFrame) -> pd.DataFrame:
        """Clean and validate manual annotations data."""
        # Remove empty rows
        df = df.dropna(subset=['Sentence_Context'])
        
        # Clean text fields
        df['Sentence_Context'] = df['Sentence_Context'].str.strip()
        
        # Validate categories
        df = df[df['Category_Framework'].notna()]
        
        return df

## File Verification and Main Processing

In [None]:
# Verify input data files are available (only the 2 we need to upload)
required_input_files = [
    'All Similes  Dubliners contSheet1.csv',  # Manual annotations
    'concordance from BNC.csv'                 # BNC baseline data
]

missing_files = []
for file in required_input_files:
    if not os.path.exists(file):
        missing_files.append(file)

if missing_files:
    print("WARNING: Missing required input data files:")
    for file in missing_files:
        print(f"  ❌ {file}")
    print("\nPlease upload these files to your repository:")
    print("  1. Your manual annotations CSV")
    print("  2. Your BNC concordance CSV")
    print("\nThe third dataset (computational extractions) will be generated by this notebook.")
else:
    print("All required input files found ✓")
    print("  ✓ Manual annotations ready")
    print("  ✓ BNC baseline data ready") 
    print("  → Computational extractions will be generated")

## Quick Test - Load Your Manual Annotations

In [None]:
# Quick test to load and examine your manual annotations
if os.path.exists('All Similes  Dubliners contSheet1.csv'):
    try:
        # Initialize processor
        processor = SimileDataProcessor()
        
        # Load manual annotations
        manual_data = processor.load_manual_annotations('All Similes  Dubliners contSheet1.csv')
        
        print("\n" + "="*60)
        print("MANUAL ANNOTATIONS DATASET SUMMARY")
        print("="*60)
        
        print(f"Total similes: {len(manual_data)}")
        print(f"Stories covered: {manual_data['Story'].nunique()}")
        print(f"Categories found: {manual_data['Category_Framework'].nunique()}")
        
        print("\nCategory distribution:")
        category_counts = manual_data['Category_Framework'].value_counts()
        for category, count in category_counts.items():
            percentage = (count / len(manual_data)) * 100
            print(f"  {category}: {count} ({percentage:.1f}%)")
        
        print("\nSample entries:")
        for i, (_, row) in enumerate(manual_data.head(3).iterrows()):
            print(f"\n{i+1}. {row['ID']} ({row['Story']})")
            print(f"   Text: {row['Sentence_Context'][:80]}...")
            print(f"   Category: {row['Category_Framework']}")
            print(f"   Comparator: {row['Comparator_Type']}")
        
        print("\n✅ Manual annotations loaded successfully!")
        
    except Exception as e:
        print(f"❌ Error loading manual annotations: {e}")
else:
    print("❌ Manual annotations file not found")
    print("Please upload 'All Similes  Dubliners contSheet1.csv' to your repository")

## Upload Instructions

**To complete the setup:**

1. **Upload your 2 CSV files** to this Colab environment or your GitHub repository:
   - `All Similes  Dubliners contSheet1.csv` (your manual annotations)
   - `concordance from BNC.csv` (your BNC baseline data)

2. **Re-run the cells above** to verify the files are loaded correctly

3. **Continue with the full processing pipeline** once both files are available

The notebook will automatically:
- Download Dubliners from Project Gutenberg
- Run computational simile extraction 
- Process all three datasets with linguistic analysis
- Export processed datasets for Notebook 2