# EAMCET Zero Manual Pipeline - Google Colab

This notebook runs the fully automated EAMCET AI tutor training pipeline without any manual annotations required.

## Features:
- ✅ No manual annotations needed
- ✅ Intelligent pattern recognition
- ✅ Automatic question and answer extraction
- ✅ Automated model training
- ✅ Works with any EAMCET PDF format

## Step 1: Setup Environment

Install all dependencies and clone the repository.

In [None]:
# Install system dependencies
!apt-get update
!apt-get install -y tesseract-ocr tesseract-ocr-eng libtesseract-dev libgl1-mesa-glx libglib2.0-0

# Clone repository (UPDATE THIS URL WITH YOUR GITHUB REPOSITORY)
# Make sure to replace YOUR_USERNAME with your actual GitHub username
!git clone https://github.com/jaganthoutam/EAMCET.git

# Change to the correct directory
%cd EAMCET

# Install Python dependencies
!pip install -r requirements.txt

# Fix PyMuPDF installation if needed
!pip uninstall fitz PyMuPDF -y 2>/dev/null || true
!pip install PyMuPDF==1.26.3

# Verify installation
import fitz
print("✅ PyMuPDF imported successfully")
print("✅ Environment ready!")
print(f"📁 Current directory: {os.getcwd()}")

## Step 2: Upload Your EAMCET PDFs

Upload your EAMCET PDF files to the data/raw_pdfs folder.

In [None]:
from google.colab import files
import os
from pathlib import Path

# Create data directory structure
os.makedirs("data/raw_pdfs", exist_ok=True)

# Upload files
uploaded = files.upload()

# Move uploaded files to data folder
for filename in uploaded.keys():
    if filename.endswith('.pdf'):
        os.rename(filename, f"data/raw_pdfs/{filename}")
        print(f"✅ Moved {filename} to data/raw_pdfs/")

print(f"\n📁 Total PDFs uploaded: {len([f for f in os.listdir('data/raw_pdfs') if f.endswith('.pdf')])}")

## Step 3: Run Zero Manual Pipeline

Execute the fully automated extraction and training pipeline.

In [None]:
# Run the zero manual pipeline
!python eamcet_zero_manual_pipeline.py --input_folder data/raw_pdfs --output_folder colab_results

print("\n✅ Pipeline completed!")
print("📁 Check the colab_results folder for outputs")

## Step 4: View Results

Examine the extracted data and training results.

In [None]:
import json
import pandas as pd
from pathlib import Path

# Load pipeline summary
if Path("colab_results/pipeline_summary.json").exists():
    with open("colab_results/pipeline_summary.json", "r") as f:
        summary = json.load(f)
    
    print("📊 PIPELINE SUMMARY:")
    print("=" * 40)
    
    print(f"Total questions extracted: {summary['extraction_stats']['total_questions']}")
    print(f"Questions with answers: {summary['extraction_stats']['paired_questions']}")
    
    print("\n📚 Subject Breakdown:")
    for subject, count in summary['extraction_stats']['subjects'].items():
        if count > 0:
            print(f"  {subject}: {count} questions")
    
    print("\n🎯 Training Data Created:")
    for data_type, count in summary['training_data_stats'].items():
        print(f"  {data_type}: {count} samples")
else:
    print("❌ Pipeline summary not found. Check if pipeline completed successfully.")
    print("📁 Available files in colab_results:")
    if os.path.exists("colab_results"):
        for file in os.listdir("colab_results"):
            print(f"  - {file}")

## Step 5: Download Results

Download the processed data and trained models.

In [None]:
from google.colab import files
import zipfile
import os

# Create zip file of results
if os.path.exists("colab_results"):
    with zipfile.ZipFile("eamcet_results.zip", "w") as zipf:
        for root, dirs, filenames in os.walk("colab_results"):
            for file in filenames:
                file_path = os.path.join(root, file)
                arcname = os.path.relpath(file_path, "colab_results")
                zipf.write(file_path, arcname)

    # Download the zip file
    files.download("eamcet_results.zip")
    print("✅ Results downloaded as eamcet_results.zip")
else:
    print("❌ No results found to download")

## Troubleshooting

If you encounter issues, try these fixes:

In [None]:
# Troubleshooting cell - run this if you have issues

# Check current directory
print(f"Current directory: {os.getcwd()}")
print(f"Directory contents: {os.listdir('.')}")

# Check if EAMCET directory exists
if os.path.exists("EAMCET"):
    print("✅ EAMCET directory found")
    %cd EAMCET
    print(f"Changed to: {os.getcwd()}")
else:
    print("❌ EAMCET directory not found")
    print("Available directories:")
    for item in os.listdir('.'):
        if os.path.isdir(item):
            print(f"  - {item}")

# Check PyMuPDF installation
try:
    import fitz
    print("✅ PyMuPDF is working")
except ImportError:
    print("❌ PyMuPDF not working, reinstalling...")
    !pip uninstall fitz PyMuPDF -y
    !pip install PyMuPDF==1.26.3
    import fitz
    print("✅ PyMuPDF reinstalled successfully")