# Edo-Meiji Polysemy Analysis - Google Colab Notebook

This notebook runs the complete Edo-Meiji semantic shift analysis pipeline.

## What This Notebook Does

1. **Setup**: Mount Google Drive and clone the repository
2. **Install Dependencies**: Install required Python packages
3. **Data Integration**: Automatically uses Meiji data downloaded via [meiji-download-to-drive.ipynb](https://github.com/jakalope/meiji-semantic-shift-analysis/blob/main/notebooks/meiji-download-to-drive.ipynb) if available
4. **Run Pipeline**: Execute the full analysis pipeline:
   - Preprocess texts (tokenization, normalization)
   - Extract BERT embeddings
   - Cluster embeddings and compute polysemy scores
   - Compare Edo vs. Meiji eras statistically
5. **View Results**: Display visualizations and statistical comparisons

## Data Sources

**Recommended workflow:**
1. First run [meiji-download-to-drive.ipynb](https://github.com/jakalope/meiji-semantic-shift-analysis/blob/main/notebooks/meiji-download-to-drive.ipynb) to download Meiji period texts from Aozora Bunko
2. Then run this notebook - it will automatically detect and use the downloaded data

**Fallback:** If you haven't run the download notebook, this notebook will use sample data instead.

## Requirements

- Google Account (for Drive access)
- Google Colab runtime (GPU recommended but not required)
- (Optional) Meiji data from meiji-download-to-drive.ipynb

## Estimated Runtime

- With sample data + GPU: ~5-10 minutes (first run may take longer for model downloads)
- With sample data + CPU: ~15-20 minutes
- With full downloaded data: Significantly longer (depends on data size)

---

## Step 1: Mount Google Drive

This will prompt you to authorize access to your Google Drive.
Your work will be saved in `/content/drive/MyDrive/Meiji_Semantic_Shift_Project/`

In [None]:
from google.colab import drive
drive.mount('/content/drive')

print("✓ Google Drive mounted successfully!")

## Step 2: Setup Project Directory

Create a dedicated folder in your Google Drive for this project.
This ensures all your work, intermediate files, and results are persisted.

In [None]:
import os

# Define project directory in Google Drive
PROJECT_DIR = "/content/drive/MyDrive/Meiji_Semantic_Shift_Project"

# Create directory if it doesn't exist
os.makedirs(PROJECT_DIR, exist_ok=True)
print(f"✓ Project directory: {PROJECT_DIR}")

# Change to project directory
%cd "{PROJECT_DIR}"
print(f"✓ Working directory: {os.getcwd()}")

## Step 3: Clone or Update Repository

This cell will:
- Clone the repository if it doesn't exist
- Pull latest changes if it already exists

In [None]:
import os

REPO_URL = "https://github.com/jakalope/meiji-semantic-shift-analysis.git"
REPO_NAME = "meiji-semantic-shift-analysis"
REPO_PATH = os.path.join(PROJECT_DIR, REPO_NAME)

if os.path.exists(REPO_PATH):
    print(f"Repository already exists at {REPO_PATH}")
    print("Pulling latest changes...")
    %cd "{REPO_PATH}"
    !git pull
    print("✓ Repository updated!")
else:
    print(f"Cloning repository to {REPO_PATH}...")
    !git clone {REPO_URL}
    %cd "{REPO_NAME}"
    print("✓ Repository cloned!")

# Verify we're in the repo
!pwd
!ls -la

### Colab-only: Install MeCab & IPADIC dictionary (run once per session)

In [None]:
# Install MeCab and IPADIC dictionary (system packages)
!apt-get update -qq
!apt-get install -y -qq mecab libmecab-dev mecab-ipadic-utf8

print("✓ MeCab system packages installed!")

In [None]:
# Install Python bindings for MeCab
!pip install mecab-python3

print("✓ mecab-python3 installed!")

In [None]:
# Verify MeCab installation
import MeCab

# Colab fix: explicitly use system mecabrc and ipadic-utf8 dictionary path
tagger = MeCab.Tagger('-r /etc/mecabrc -d /var/lib/mecab/dic/ipadic-utf8')
print(tagger.parse("明治時代の本は面白い").strip())

print("\n✓ MeCab is working correctly!")

**MeCab note for Colab**: Initialization uses `-r /etc/mecabrc -d /var/lib/mecab/dic/ipadic-utf8` because the python wrapper defaults to a non-existent `/usr/local/etc/mecabrc` path.

## Step 4: Install Dependencies

Install all required Python packages from requirements.txt.

**Note**: This may take a few minutes on first run.

In [None]:
# Install requirements
!pip install -q -r requirements.txt

print("\n" + "="*60)
print("✓ All dependencies installed!")
print("="*60)

## Step 5: Setup Python Path

Add the `src/` directory to Python path so we can import modules.
This fixes the import issue mentioned in the project documentation.

In [None]:
import sys
import os

# Get the repository root directory
REPO_ROOT = os.getcwd()
SRC_PATH = os.path.join(REPO_ROOT, 'src')

# Add src directory to Python path
if SRC_PATH not in sys.path:
    sys.path.insert(0, SRC_PATH)
    print(f"✓ Added {SRC_PATH} to Python path")

# Verify imports work
try:
    import utils
    import data_preprocess
    import embedding_extraction
    import polysemy_clustering
    import compare_eras
    print("✓ All modules can be imported successfully!")
except ImportError as e:
    print(f"❌ Import error: {e}")
    print("Please ensure all dependencies are installed.")

## Step 6: Check GPU Availability

Check if GPU is available for faster processing.

**Tip**: To enable GPU in Colab, go to `Runtime` > `Change runtime type` > Select `GPU`

In [None]:
import torch

if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"✓ GPU is available: {torch.cuda.get_device_name(0)}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
else:
    device = torch.device('cpu')
    print("ℹ GPU not available, using CPU")
    print("  Note: Processing will be slower. Consider enabling GPU in Runtime settings.")

print(f"\nUsing device: {device}")

## Step 7: Create Output Directories

Create directories for storing intermediate files and results.
These will be persisted in your Google Drive.

In [None]:
import os

# Define directories
DATA_PROCESSED = os.path.join(REPO_ROOT, 'data', 'processed')
DATA_EMBEDDINGS = os.path.join(REPO_ROOT, 'data', 'embeddings')
RESULTS_DIR = os.path.join(REPO_ROOT, 'results')
LOGS_DIR = os.path.join(REPO_ROOT, 'logs')

# Create directories
for directory in [DATA_PROCESSED, DATA_EMBEDDINGS, RESULTS_DIR, LOGS_DIR]:
    os.makedirs(directory, exist_ok=True)
    print(f"✓ {directory}")

print("\n✓ All directories created!")

---

# Running the Pipeline

Now we'll run the complete analysis pipeline on sample data.
Each step is in a separate cell so you can see the progress.

---

## Pipeline Step 1: Preprocess Texts

Tokenize texts, extract word frequencies, and gather contexts for target words.

**Input**: 
- Edo texts from `data/samples/edo/`
- Meiji texts from either:
  - Google Drive (if you ran meiji-download-to-drive.ipynb): `/content/drive/MyDrive/meiji-semantic-data/meiji/`
  - Sample data (fallback): `data/samples/meiji/`

**Output**: 
- Word frequencies (CSV)
- Word contexts (JSON)

**Time**: ~30 seconds to 1 minute for sample data, longer for full downloaded data

## Setup Meiji Data Path

This cell configures the path to Meiji period data.

**Two options:**

1. **Use downloaded data from Drive** (Recommended): If you've run the [meiji-download-to-drive.ipynb](https://github.com/jakalope/meiji-semantic-shift-analysis/blob/main/notebooks/meiji-download-to-drive.ipynb) notebook, it saves data to `/content/drive/MyDrive/meiji-semantic-data/meiji/`

2. **Use sample data**: Use the built-in sample data in `data/samples/meiji/`

The cell below will check if the Drive data exists and use it if available, otherwise fall back to sample data.

In [None]:
import os
from pathlib import Path

# Define the path where meiji-download-to-drive.ipynb saves data
MEIJI_DRIVE_DIR = Path('/content/drive/MyDrive/meiji-semantic-data/meiji')
MEIJI_DRIVE_FILE = MEIJI_DRIVE_DIR / 'meiji_aozora_combined.txt'

# Define local meiji data directory in the repo
MEIJI_LOCAL_DIR = Path(REPO_ROOT) / 'data' / 'meiji'
MEIJI_LOCAL_DIR.mkdir(parents=True, exist_ok=True)

# Check if Drive data exists
if MEIJI_DRIVE_FILE.exists():
    print(f"✓ Found Meiji data from meiji-download-to-drive.ipynb")
    print(f"  Location: {MEIJI_DRIVE_FILE}")
    
    # Copy the file to local data directory for processing
    import shutil
    local_meiji_file = MEIJI_LOCAL_DIR / 'meiji_aozora_combined.txt'
    
    if not local_meiji_file.exists():
        print(f"  Copying to {local_meiji_file}...")
        shutil.copy(MEIJI_DRIVE_FILE, local_meiji_file)
        print("  ✓ Copy complete!")
    else:
        print(f"  ✓ Already copied to {local_meiji_file}")
    
    # File size check
    file_size_mb = local_meiji_file.stat().st_size / (1024 * 1024)
    print(f"  File size: {file_size_mb:.2f} MB")
    
    MEIJI_DATA_DIR = str(MEIJI_LOCAL_DIR)
    USE_DOWNLOADED_DATA = True
else:
    print("ℹ Meiji data from Drive not found.")
    print(f"  Expected location: {MEIJI_DRIVE_FILE}")
    print("  Falling back to sample data...")
    
    MEIJI_DATA_DIR = os.path.join(REPO_ROOT, 'data', 'samples', 'meiji')
    USE_DOWNLOADED_DATA = False
    
    if os.path.exists(MEIJI_DATA_DIR):
        print(f"  ✓ Using sample data from {MEIJI_DATA_DIR}")
    else:
        print(f"  ⚠ Warning: Sample data directory not found at {MEIJI_DATA_DIR}")

print(f"\nMeiji data directory: {MEIJI_DATA_DIR}")
print(f"Using downloaded data: {USE_DOWNLOADED_DATA}")

In [None]:
%%time

print("="*60)
print("STEP 1: PREPROCESSING TEXTS")
print("="*60)
print()

# Use configured Meiji data path
EDO_DATA_DIR = os.path.join(REPO_ROOT, 'data', 'samples', 'edo')

print(f"Edo data directory:   {EDO_DATA_DIR}")
print(f"Meiji data directory: {MEIJI_DATA_DIR}")
print()

# Run preprocessing
!python src/data_preprocess.py \
    --edo-dir "{EDO_DATA_DIR}" \
    --meiji-dir "{MEIJI_DATA_DIR}" \
    --output data/processed \
    --top-n 20 \
    --min-freq 3 \
    --max-contexts 100

print("\n✓ Preprocessing complete!")
print(f"\nGenerated files:")
!ls -lh data/processed/

## Pipeline Step 2: Extract BERT Embeddings

Use Japanese BERT to generate contextual embeddings for each word occurrence.

**Model**: cl-tohoku/bert-base-japanese (~400MB)

**Note**: First run will download the model, which may take a few minutes.

**Time**: 
- First run with download: ~3-5 minutes
- Subsequent runs: ~1-2 minutes (GPU) or ~3-5 minutes (CPU)

In [None]:
%%time

print("="*60)
print("STEP 2: EXTRACTING BERT EMBEDDINGS")
print("="*60)
print()

# Run embedding extraction
!python src/embedding_extraction.py \
    --input data/processed \
    --output data/embeddings \
    --model cl-tohoku/bert-base-japanese \
    --batch-size 16 \
    --device auto

print("\n✓ Embedding extraction complete!")
print(f"\nGenerated files:")
!ls -lh data/embeddings/

## Pipeline Step 3: Cluster Embeddings & Compute Polysemy

Cluster embeddings for each word to estimate number of distinct senses.
Calculate polysemy indices based on cluster count and quality.

**Method**: K-means clustering with silhouette score evaluation

**Time**: ~1-2 minutes

In [None]:
%%time

print("="*60)
print("STEP 3: CLUSTERING & POLYSEMY ANALYSIS")
print("="*60)
print()

# Run clustering and polysemy computation
!python src/polysemy_clustering.py \
    --input data/embeddings \
    --output results \
    --min-contexts 5

print("\n✓ Clustering and polysemy analysis complete!")
print(f"\nGenerated files:")
!ls -lh results/*.csv

## Pipeline Step 4: Compare Eras Statistically

Compare polysemy scores between Edo and Meiji periods using statistical tests.
Generate visualizations showing the comparison.

**Tests**: T-test, Mann-Whitney U, Cohen's d effect size

**Time**: ~30 seconds

In [None]:
%%time

print("="*60)
print("STEP 4: ERA COMPARISON")
print("="*60)
print()

# Run statistical comparison
!python src/compare_eras.py \
    --input results \
    --output results \
    --alpha 0.05

print("\n✓ Era comparison complete!")
print(f"\nGenerated files:")
!ls -lh results/

---

# Viewing Results

Now let's examine the results of our analysis.

---

## Statistical Comparison Results

Load and display the statistical comparison between Edo and Meiji eras.

In [None]:
import json
import pandas as pd

# Load statistical comparison
with open('results/statistical_comparison.json', 'r', encoding='utf-8') as f:
    stats = json.load(f)

print("="*60)
print("STATISTICAL COMPARISON: EDO VS MEIJI")
print("="*60)
print()

print(f"Sample Sizes:")
print(f"  Edo period:   {stats['edo_count']} words")
print(f"  Meiji period: {stats['meiji_count']} words")
print()

print(f"Mean Polysemy Index:")
print(f"  Edo period:   {stats['edo_mean']:.3f}")
print(f"  Meiji period: {stats['meiji_mean']:.3f}")
print(f"  Difference:   {stats['mean_difference']:.3f}")
print()

print(f"Statistical Tests:")
print(f"  T-test p-value:           {stats['ttest_pvalue']:.4f}")
print(f"  Mann-Whitney p-value:     {stats['mannwhitney_pvalue']:.4f}")
print(f"  Cohen's d (effect size):  {stats['cohens_d']:.3f}")
print()

# Interpret significance
alpha = 0.05
if stats['ttest_pvalue'] < alpha:
    print(f"✓ Results are statistically significant (p < {alpha})")
else:
    print(f"  Results are not statistically significant (p >= {alpha})")

# Interpret effect size
d = abs(stats['cohens_d'])
if d < 0.2:
    effect = "negligible"
elif d < 0.5:
    effect = "small"
elif d < 0.8:
    effect = "medium"
else:
    effect = "large"

print(f"  Effect size: {effect}")
print()

## Word-Level Comparison

View polysemy changes for individual words.

In [None]:
import pandas as pd

# Load word-level comparison
word_comparison = pd.read_csv('results/word_level_comparison.csv')

print("="*60)
print("WORD-LEVEL POLYSEMY CHANGES")
print("="*60)
print()

# Sort by absolute change
word_comparison['abs_change'] = word_comparison['polysemy_change'].abs()
word_comparison_sorted = word_comparison.sort_values('abs_change', ascending=False)

print("Top words with largest polysemy changes:")
print()
display(word_comparison_sorted.head(10)[['word', 'edo_polysemy', 'meiji_polysemy', 'polysemy_change']])

print("\nWords with increased polysemy (Meiji > Edo):")
increased = word_comparison[word_comparison['polysemy_change'] > 0].sort_values('polysemy_change', ascending=False)
if len(increased) > 0:
    display(increased.head(5)[['word', 'edo_polysemy', 'meiji_polysemy', 'polysemy_change']])
else:
    print("  No words with increased polysemy")

print("\nWords with decreased polysemy (Edo > Meiji):")
decreased = word_comparison[word_comparison['polysemy_change'] < 0].sort_values('polysemy_change')
if len(decreased) > 0:
    display(decreased.head(5)[['word', 'edo_polysemy', 'meiji_polysemy', 'polysemy_change']])
else:
    print("  No words with decreased polysemy")

## Visualizations

Display the generated plots comparing Edo and Meiji polysemy.

In [None]:
from IPython.display import Image, display
import os

# List of plot files to display
plot_files = [
    ('results/polysemy_distribution.png', 'Polysemy Distribution Comparison'),
    ('results/polysemy_boxplot.png', 'Polysemy Box Plot'),
    ('results/cluster_comparison.png', 'Cluster Count Comparison'),
    ('results/top_polysemy_changes.png', 'Top Polysemy Changes')
]

print("="*60)
print("VISUALIZATIONS")
print("="*60)
print()

for filepath, title in plot_files:
    if os.path.exists(filepath):
        print(f"\n{title}:")
        print("-" * len(title))
        display(Image(filename=filepath))
    else:
        print(f"\n⚠ {title}: File not found at {filepath}")

---

## Summary

✓ **Pipeline completed successfully!**

### What we did:

1. ✓ Configured data sources (using downloaded Meiji data if available, or sample data as fallback)
2. ✓ Preprocessed texts from Edo and Meiji periods
3. ✓ Extracted contextual BERT embeddings for target words
4. ✓ Clustered embeddings to estimate word polysemy
5. ✓ Compared polysemy distributions statistically
6. ✓ Generated visualizations and reports

### Output files (saved in Google Drive):

**Processed Data:**
- `data/processed/edo_contexts.json` - Extracted Edo word contexts
- `data/processed/meiji_contexts.json` - Extracted Meiji word contexts
- `data/processed/*_word_frequencies.csv` - Word frequency tables

**Embeddings:**
- `data/embeddings/edo_embeddings.pkl` - Edo BERT embeddings
- `data/embeddings/meiji_embeddings.pkl` - Meiji BERT embeddings

**Results:**
- `results/statistical_comparison.json` - Statistical test results
- `results/word_level_comparison.csv` - Per-word polysemy comparison
- `results/*.png` - Visualization plots

### Next Steps:

1. **Get more data**: If you used sample data, run [meiji-download-to-drive.ipynb](https://github.com/jakalope/meiji-semantic-shift-analysis/blob/main/notebooks/meiji-download-to-drive.ipynb) to download full Meiji period texts from Aozora Bunko, then re-run this notebook
2. **Explore results**: Use the exploratory analysis notebook for deeper investigation
3. **Customize analysis**: Modify parameters in the pipeline cells above
4. **Export findings**: Download results from your Google Drive

### For Full Analysis:

**Meiji data:**
1. Run [meiji-download-to-drive.ipynb](https://github.com/jakalope/meiji-semantic-shift-analysis/blob/main/notebooks/meiji-download-to-drive.ipynb) to download ~32M characters of Meiji texts
2. Re-run this notebook - it will automatically detect and use the downloaded data

**Edo data:**
1. Upload Edo period texts to `data/edo/` in your Google Drive project folder
2. Modify the EDO_DATA_DIR variable in the preprocessing cell

**Adjust parameters:**
- Increase `--top-n` to analyze more words
- Adjust `--min-freq` and `--max-contexts` as needed

---

## References

- **Repository**: [github.com/jakalope/meiji-semantic-shift-analysis](https://github.com/jakalope/meiji-semantic-shift-analysis)
- **Meiji Data Download**: [meiji-download-to-drive.ipynb](https://github.com/jakalope/meiji-semantic-shift-analysis/blob/main/notebooks/meiji-download-to-drive.ipynb)
- **BERT Model**: [cl-tohoku/bert-base-japanese](https://huggingface.co/cl-tohoku/bert-base-japanese)
- **Text Source**: [Aozora Bunko](https://www.aozora.gr.jp/)

---

## Troubleshooting

### Common Issues:

**1. Out of Memory Error**
- Solution: Reduce `--batch-size` in embedding extraction cell (try 8 or 4)
- Or: Use CPU instead of GPU by changing `--device auto` to `--device cpu`

**2. MeCab Import Error**
- Solution: Run `!pip install mecab-python3` in a cell
- Note: MeCab installation issues are rare on Colab

**3. Model Download Fails**
- Solution: Check internet connection and try running the embedding cell again
- The model is cached after first successful download

**4. "Module not found" Error**
- Solution: Re-run the "Setup Python Path" cell (Step 5)
- Make sure you've run all cells in order

**5. Empty Results**
- Check: Are there sample text files in `data/samples/edo/` and `data/samples/meiji/`?
- Solution: Verify with `!ls -la data/samples/edo/` and `!ls -la data/samples/meiji/`

### Need Help?

- Check the [GitHub repository](https://github.com/jakalope/meiji-semantic-shift-analysis) for issues
- Read the [USAGE.md](https://github.com/jakalope/meiji-semantic-shift-analysis/blob/main/USAGE.md) guide
- Open a new issue with your error message and system info

---