<a href="https://colab.research.google.com/github/monjurkuet/yt-crawler/blob/main/crawl_yt_videos.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Automated YouTube Data Ingestion Pipeline Setup & Execution

This notebook demonstrates how to automatically set up the environment in Google Colab, clone the GitHub repository containing the modularized data ingestion code, install dependencies, manage credentials, and run the data ingestion process. This is designed for a fresh Colab environment.

**Before Running:**
1.  **Ensure you have the GitHub repository:** `https://github.com/monjurkuet/yt-crawler.git` contains the `googleapis` directory with the Python modules.
2.  **Configure Colab Userdata Secrets:** Make sure the following secrets are set in your Colab environment (under the üîë icon on the left panel):
    *   `API_KEY` (Your YouTube Data API key)
    *   `SSH_HOST` (Your SSH server host address)
    *   `DATABASE_NAME` (Your MySQL database name)
    *   `DATABASE_PASSWORD`
3.  **Ensure Google Drive Files are Present:**
    *   Your SSH private key file (e.g., `databasemart`) in `/content/drive/MyDrive/cloudaccess/`
    *   Your `videoids.txt` file in `/content/drive/MyDrive/cloudaccess/` (one video ID per line)
Once these prerequisites are met, simply run all cells in this notebook.

In [1]:
import os
import sys
from google.colab import drive
from google.colab import userdata
import warnings

# --- Configuration Variables (Adjust as needed) ---
REPO_URL = 'https://github.com/monjurkuet/yt-crawler.git'
REPO_NAME = 'yt-crawler'
LOCAL_REPO_PATH = os.path.join('/content', REPO_NAME)
MODULES_DIR = os.path.join(LOCAL_REPO_PATH, 'googleapis')

SSH_PRIVATEKEY_DRIVE_PATH = '/content/drive/MyDrive/cloudaccess/databasemart'
VIDEO_IDS_DRIVE_PATH = '/content/drive/MyDrive/cloudaccess/videoids.txt'

# --- 0. Suppress Paramiko UserWarning ---
warnings.filterwarnings('ignore', category=UserWarning, module='paramiko')

print("--- Starting Automated Colab Setup and Execution ---")

# --- 1. Install Git (if not already present) ---
print("1. Checking for Git installation...")
!apt-get update > /dev/null
!apt-get install -y git > /dev/null
print("   Git installation checked/completed.")

# --- 2. Clone the GitHub Repository ---
print(f"2. Cloning GitHub repository: {REPO_URL} into {LOCAL_REPO_PATH}...")
if not os.path.exists(LOCAL_REPO_PATH):
    !git clone {REPO_URL} {LOCAL_REPO_PATH}
    print("   Repository cloned.")
else:
    print("   Repository already cloned. Skipping.")
    # Optionally, pull latest changes
    # %cd {LOCAL_REPO_PATH}
    # !git pull
    # %cd /

# --- 3. Change directory to the cloned repository for pip install ---
# This is important for pip to find requirements.txt correctly
os.chdir(LOCAL_REPO_PATH)
print(f"3. Changed current working directory to: {os.getcwd()}")

# --- 4. Install Dependencies from requirements.txt ---
print(f"4. Installing Python packages from {MODULES_DIR}/requirements.txt...")
!pip install -r {MODULES_DIR}/requirements.txt
print("   Dependencies installed.")

# --- 5. Mount Google Drive and Set Permissions ---
print("5. Mounting Google Drive...")
if not os.path.exists('/content/drive'):
    drive.mount('/content/drive')
    print("   Google Drive mounted.")
else:
    print("   Google Drive already mounted.")

print(f"   Setting permissions for SSH private key: {SSH_PRIVATEKEY_DRIVE_PATH}...")
if os.path.exists(SSH_PRIVATEKEY_DRIVE_PATH):
    !chmod 600 {SSH_PRIVATEKEY_DRIVE_PATH}
    print("   SSH private key permissions set.")
else:
    print(f"   WARNING: SSH private key not found at {SSH_PRIVATEKEY_DRIVE_PATH}. Data ingestion might fail.")

# --- 6. Export Colab Userdata as Environment Variables ---
print("6. Setting environment variables from Colab userdata...")
# These variables are read by config_manager.py, prioritizing os.environ
os.environ['API_KEY'] = userdata.get('API_KEY')
os.environ['SSH_HOST'] = userdata.get('SSH_HOST')
os.environ['SSH_USERNAME'] = 'administrator' # Fixed username
os.environ['SSH_PRIVATEKEY_PATH'] = SSH_PRIVATEKEY_DRIVE_PATH # Use the path confirmed above
os.environ['DATABASE_NAME'] = userdata.get('DATABASE_NAME')
os.environ['DATABASE_PASSWORD'] = userdata.get('DATABASE_PASSWORD')

# Add other fixed parameters that config_manager.py might expect as environment variables
os.environ['LOCAL_PORT'] = '3307'
os.environ['REMOTE_MYSQL_HOST'] = '127.0.0.1'
os.environ['REMOTE_MYSQL_PORT'] = '3306'

print("   Environment variables set.")

# --- 7. Add googleapis directory to Python path ---
print(f"7. Adding {MODULES_DIR} to Python's system path...")
if MODULES_DIR not in sys.path:
    sys.path.insert(0, MODULES_DIR)
    print(f"   Added {MODULES_DIR} to sys.path.")
else:
    print(f"   {MODULES_DIR} already in sys.path. Skipping.")

# --- 8. Execute main.py from the cloned repository ---
print("\n--- 8. Starting data ingestion process ---")

try:
    from main import DataIngestor # Import from the googleapis directory due to sys.path adjustment
    ingestor = DataIngestor()
    ingestor.ingest_data()
    print("\n--- Data ingestion process finished successfully. ---")
except ModuleNotFoundError as e:
    print(f"\nERROR: Could not import DataIngestor. Check sys.path and module names. Details: {e}")
    print("Please ensure 'googleapis' directory contains main.py and is correctly added to sys.path.")
except Exception as e:
    print(f"\nCRITICAL ERROR during data ingestion: {e}")
    import traceback
    traceback.print_exc()

print("--- Automated Colab Setup and Execution Complete ---")


--- Starting Automated Colab Setup and Execution ---
1. Checking for Git installation...
W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)
   Git installation checked/completed.
2. Cloning GitHub repository: https://github.com/monjurkuet/yt-crawler.git into /content/yt-crawler...
Cloning into '/content/yt-crawler'...
remote: Enumerating objects: 150, done.[K
remote: Counting objects: 100% (150/150), done.[K
remote: Compressing objects: 100% (117/117), done.[K
remote: Total 150 (delta 55), reused 117 (delta 26), pack-reused 0 (from 0)[K
Receiving objects: 100% (150/150), 177.14 KiB | 2.21 MiB/s, done.
Resolving deltas: 100% (55/55), done.
   Repository cloned.
3. Changed current working directory to: /content/yt-crawler
4. Installing Python packages from /content/yt-crawler/googleapis/requirements.txt...
Collecting sshtunnel==0.4.0 (from -r /cont

INFO:video_processor:Successfully read 6 video IDs from /content/drive/MyDrive/cloudaccess/videoids.txt


2025-12-02 01:39:18,508 - INFO - Attempting to establish database connection...


INFO:video_processor:Attempting to establish database connection...
2025-12-02 01:39:18,510 - db_connector - INFO - Attempting to establish SSH tunnel...
INFO:db_connector:Attempting to establish SSH tunnel...
2025-12-02 01:39:21,150 - db_connector - INFO - SSH tunnel established on local port 3307
INFO:db_connector:SSH tunnel established on local port 3307
2025-12-02 01:39:21,152 - db_connector - INFO - Attempting to connect to MySQL database...
INFO:db_connector:Attempting to connect to MySQL database...
2025-12-02 01:39:23,599 - db_connector - INFO - Successfully connected to MySQL database.
INFO:db_connector:Successfully connected to MySQL database.


2025-12-02 01:39:23,602 - INFO - Ensuring database table exists...


INFO:video_processor:Ensuring database table exists...
2025-12-02 01:39:23,987 - db_connector - INFO - Executing CREATE TABLE statement...
INFO:db_connector:Executing CREATE TABLE statement...
2025-12-02 01:39:24,388 - db_connector - INFO - Table 'youtube_videos' ensured.
INFO:db_connector:Table 'youtube_videos' ensured.


2025-12-02 01:39:24,584 - INFO - Processing 6 video IDs...


INFO:video_processor:Processing 6 video IDs...


Processing Videos:   0%|          | 0/6 [00:00<?, ?it/s]

2025-12-02 01:39:25,906 - INFO - Successfully inserted video ID: Dmg-mPHLlRs


INFO:video_processor:Successfully inserted video ID: Dmg-mPHLlRs


2025-12-02 01:39:26,722 - INFO - Successfully inserted video ID: 5Tb9c7Cl1mM


INFO:video_processor:Successfully inserted video ID: 5Tb9c7Cl1mM


2025-12-02 01:39:27,535 - INFO - Successfully inserted video ID: DlzpAHwVi2g


INFO:video_processor:Successfully inserted video ID: DlzpAHwVi2g


2025-12-02 01:39:28,342 - INFO - Successfully inserted video ID: zpQAAPbcx_I


INFO:video_processor:Successfully inserted video ID: zpQAAPbcx_I


2025-12-02 01:39:29,150 - INFO - Successfully inserted video ID: 898BXxiXPbA


INFO:video_processor:Successfully inserted video ID: 898BXxiXPbA


2025-12-02 01:39:29,960 - INFO - Successfully inserted video ID: v8mGNw3WWLw


INFO:video_processor:Successfully inserted video ID: v8mGNw3WWLw
2025-12-02 01:39:30,156 - db_connector - INFO - MySQL connection closed.
INFO:db_connector:MySQL connection closed.
2025-12-02 01:39:30,177 - db_connector - INFO - SSH tunnel stopped.
INFO:db_connector:SSH tunnel stopped.


2025-12-02 01:39:30,179 - INFO - 
--- Processing Summary ---


INFO:video_processor:
--- Processing Summary ---


2025-12-02 01:39:30,181 - INFO - Total video IDs from file: 6


INFO:video_processor:Total video IDs from file: 6


2025-12-02 01:39:30,182 - INFO - Total processed attempts: 6


INFO:video_processor:Total processed attempts: 6


2025-12-02 01:39:30,186 - INFO - Successfully inserted (or already existed): 6


INFO:video_processor:Successfully inserted (or already existed): 6


2025-12-02 01:39:30,187 - INFO - Errors encountered: 0


INFO:video_processor:Errors encountered: 0


2025-12-02 01:39:30,189 - INFO - Check 'inserted_video_ids.log' for successful insertions.


INFO:video_processor:Check 'inserted_video_ids.log' for successful insertions.


2025-12-02 01:39:30,190 - INFO - Check 'error_log.log' for detailed error information.


INFO:video_processor:Check 'error_log.log' for detailed error information.



--- Data ingestion process finished successfully. ---
--- Automated Colab Setup and Execution Complete ---
