# TN PDS Crawler - Google Colab Runner

This notebook allows you to run the Tamil Nadu PDS Crawler in Google Colab environment.

## Features
- Automatic setup of Chrome and ChromeDriver
- Runs the crawler in headless mode
- Saves results to Google Drive (optional)
- Downloads results to your local machine

## 1. Setup Environment

First, let's install the required dependencies.

In [None]:
# Install required packages
!pip install selenium webdriver-manager flask requests python-dotenv

## 2. Get the Code

You can either clone the repository (if it's public) or upload the necessary files.

In [None]:
# Option 1: Clone the repository (if it's public)
!git clone https://github.com/gunaseelan13/tn-pds-crawler.git
%cd tn-pds-crawler

In [None]:
# Option 2: Upload files directly
# Uncomment and run this cell if you prefer to upload files instead of cloning

# from google.colab import files
# print("Please upload the crawai_pds_selenium.py file:")
# uploaded = files.upload()
# print("Please upload the shop_list.json file:")
# uploaded = files.upload()

## 3. Setup Chrome and ChromeDriver

Google Colab comes with Chrome pre-installed, but we'll make sure it's properly configured.

In [None]:
# Make sure Chrome is installed and get its version
!apt-get update
!apt-get install -y chromium-browser
!chromium-browser --version
!which chromium-browser

## 4. Create Directories

Create necessary directories for output files.

In [None]:
# Create directories for output
!mkdir -p data

# Get current date for filename
import datetime
current_date = datetime.datetime.now().strftime("%Y%m%d")
output_filename = f"data/shop_status_results_{current_date}.json"
print(f"Results will be saved to: {output_filename}")

## 5. Modify Crawler for Colab (Optional)

We can optionally modify the crawler script to work better in Colab environment.

In [None]:
# This cell is optional - it adds some Colab-specific modifications to the crawler
# Uncomment and run if you want to apply these changes

'''
import fileinput
import sys

# Add Colab-specific Chrome options
with fileinput.FileInput("crawai_pds_selenium.py", inplace=True) as file:
    for line in file:
        if "chrome_options.add_argument(\"--headless\")" in line:
            print(line, end='')
            print("        chrome_options.add_argument(\"--disable-dev-shm-usage\")  # Overcome limited resource problems in Colab")
            print("        chrome_options.add_argument(\"--no-sandbox\")  # Required in Colab")
        else:
            print(line, end='')
'''

## 6. Run the Crawler

Now let's run the crawler with the appropriate options.

In [None]:
# Run the crawler
!python crawai_pds_selenium.py --shop-list-json shop_list.json --output-json $output_filename --headless

## 7. Save Results to Google Drive (Optional)

You can save the results to Google Drive for persistence.

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')

# Create a directory in Google Drive if it doesn't exist
!mkdir -p "/content/drive/My Drive/TN_PDS_Crawler_Results"

# Copy the results to Google Drive
!cp $output_filename "/content/drive/My Drive/TN_PDS_Crawler_Results/"
print(f"Results saved to Google Drive at: /content/drive/My Drive/TN_PDS_Crawler_Results/{output_filename.split('/')[-1]}")

## 8. Download Results

Download the results to your local machine.

In [None]:
# Download the results
from google.colab import files
files.download(output_filename)

## 9. Debug Information

If the crawler encounters issues, you can run these cells to get more information.

In [None]:
# Check if screenshots were saved
!ls -la *.png

# Check if page source was saved
!ls -la *.html

# Display one of the screenshots (if available)
import glob
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

screenshot_files = glob.glob("*.png")
if screenshot_files:
    img = mpimg.imread(screenshot_files[0])
    plt.figure(figsize=(15, 10))
    plt.imshow(img)
    plt.axis('off')
    plt.title(f"Screenshot: {screenshot_files[0]}")
    plt.show()
else:
    print("No screenshots found")

## 10. Schedule Regular Runs (Advanced)

Note: Google Colab has limitations on how long notebooks can run. For true scheduling, consider using GitHub Actions or a dedicated server.

However, you can use this cell to run the crawler multiple times with delays.

In [None]:
# This is a simple scheduler that will run the crawler multiple times
# Note: Colab will disconnect after a period of inactivity, so this is not a true scheduling solution
'''
import time
import datetime

# How many times to run
runs = 3
# Hours between runs
hours_between = 1

for i in range(runs):
    print(f"Run {i+1}/{runs} starting at {datetime.datetime.now()}")
    
    # Generate filename with current timestamp
    current_time = datetime.datetime.now().strftime("%Y%m%d_%H%M%S")
    output_file = f"data/shop_status_results_{current_time}.json"
    
    # Run the crawler
    !python crawai_pds_selenium.py --shop-list-json shop_list.json --output-json $output_file --headless
    
    # Copy to Google Drive if mounted
    try:
        !cp $output_file "/content/drive/My Drive/TN_PDS_Crawler_Results/"
        print(f"Saved to Google Drive")
    except:
        print("Could not save to Google Drive - make sure it's mounted")
    
    if i < runs - 1:  # Don't sleep after the last run
        sleep_seconds = hours_between * 3600
        print(f"Sleeping for {hours_between} hours until next run...")
        time.sleep(sleep_seconds)
'''