# 01: Setup & Data Collection

This notebook handles:
- **Step 0**: Environment setup, API configuration, and folder structure
- **Step 1**: Scraping Reddit posts from r/BrawlStars and downloading media locally

**Output**: `data/raw_data.csv` with post metadata and local media paths

---

## Step 0: Preparation

First, we set up our environment. This involves installing all necessary libraries, setting up our API keys, and creating the directories where we'll store our data and media.

In [None]:
# Install required packages (run this in your terminal/virtual environment first)
# pip install praw pmaw pandas requests google-generativeai scikit-learn transformers torch torchvision opencv-python-headless tqdm seaborn matplotlib

### Import Libraries

In [None]:
# --- Imports ---
import praw                     # For Reddit API access
import pandas as pd             # For data manipulation
import requests                 # For downloading files
import os                       # For file/directory operations
from tqdm.auto import tqdm      # For progress bars
from datetime import datetime   # For date handling


print("‚úÖ All libraries imported successfully!")

### API Keys & Configuration

‚ö†Ô∏è **IMPORTANT**: Replace the placeholder values with your actual API keys.

**Best Practice**: Store these in a `.env` file and use `python-dotenv` to load them.

In [None]:
# --- API Keys & Config ---
# !! IMPORTANT: Replace with your actual API keys
# Get Reddit API credentials at: https://www.reddit.com/prefs/apps
# --- API Keys ---
REDDIT_CLIENT_ID = os.getenv("REDDIT_CLIENT_ID")
REDDIT_CLIENT_SECRET = os.getenv("REDDIT_CLIENT_SECRET")
REDDIT_USER_AGENT = "BrawlStars Sentiment Scraper v3.0 by /u/YOUR_USERNAME"

GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

print("‚úÖ API keys configured")

print("‚úÖ API keys configured (make sure you replaced the placeholders!)")

### Project Constants & Directory Setup

In [None]:
# --- Project Constants ---
SUBREDDIT_NAME = "Brawlstars"
POST_LIMIT = 1200  # Scrape 1200 to aim for ~1000 good posts

# --- File & Directory Setup ---
MEDIA_DIR = "media"
IMAGE_DIR = os.path.join(MEDIA_DIR, "images")
VIDEO_DIR = os.path.join(MEDIA_DIR, "videos")
DATA_DIR = "data"

# Create directories if they don't exist
os.makedirs(IMAGE_DIR, exist_ok=True)
os.makedirs(VIDEO_DIR, exist_ok=True)
os.makedirs(DATA_DIR, exist_ok=True)

# --- File Paths ---
RAW_DATA_CSV = os.path.join(DATA_DIR, 'raw_data.csv')

print(f"‚úÖ Directory structure created:")
print(f"   üìÅ {IMAGE_DIR}")
print(f"   üìÅ {VIDEO_DIR}")
print(f"   üìÅ {DATA_DIR}")
print(f"\nüìÑ Output will be saved to: {RAW_DATA_CSV}")

---

## Step 1: Data Collection (Scraping and Downloading)

Here, we connect to the Reddit API using PRAW, scrape the latest posts from r/BrawlStars, and‚Äîmost importantly‚Äîdownload the associated image or video for each post. We save the *local path* to this media in our DataFrame.

### Initialize Reddit API Client

In [None]:
# Initialize PRAW (Reddit API client)
reddit = praw.Reddit(
    client_id=REDDIT_CLIENT_ID,
    client_secret=REDDIT_CLIENT_SECRET,
    user_agent=REDDIT_USER_AGENT,
)

# Verify connection
try:
    # This should return None for read-only (script) authentication
    user = reddit.user.me()
    print(f"‚úÖ Connected to Reddit API")
    print(f"   User: {user if user else 'Read-only access (script mode)'}")
except Exception as e:
    print(f"‚ùå Error connecting to Reddit: {e}")
    print("   Please check your API credentials!")

### Define Media Download Function

In [None]:
def is_gallery_post(post):
    """
    Check if a post is a gallery post (multiple images).
    
    Args:
        post: A PRAW submission object
        
    Returns:
        bool: True if post is a gallery, False otherwise
    """
    # Gallery posts have gallery_data or media_metadata attributes
    return hasattr(post, 'gallery_data') or hasattr(post, 'media_metadata')


def download_media(post):
    """
    Downloads the media (image or video) for a PRAW post and returns the local file path.
    
    Args:
        post: A PRAW submission object
        
    Returns:
        str: Local file path if media was downloaded, None otherwise
    """
    # Skip gallery posts
    if is_gallery_post(post):
        return None
    
    post_hint = getattr(post, 'post_hint', None)
    media_url = None
    local_path = None
    file_ext = ".unknown"

    try:
        if post_hint == 'image':
            media_url = post.url
            file_ext = os.path.splitext(media_url)[1]
            if not file_ext:
                file_ext = ".jpg"  # Default for images without clear extension
            local_path = os.path.join(IMAGE_DIR, f"{post.id}{file_ext}")

        elif post_hint == 'hosted:video':
            media_url = post.media['reddit_video']['fallback_url']
            file_ext = ".mp4"
            local_path = os.path.join(VIDEO_DIR, f"{post.id}{file_ext}")
        
        elif post_hint == 'rich:video':
            # These are often YouTube links, etc. We'll skip downloading them for now.
            # You could use youtube-dlp if you want to handle these.
            pass

        # If we have a URL and a path, download the file
        if media_url and local_path:
            if os.path.exists(local_path):
                return local_path  # Already downloaded

            response = requests.get(media_url, stream=True)
            response.raise_for_status()  # Raise an exception for bad status codes
            
            with open(local_path, 'wb') as f:
                for chunk in response.iter_content(chunk_size=8192):
                    f.write(chunk)
            return local_path

    except Exception as e:
        # Silently fail for individual posts
        return None
    
    return None  # No downloadable media

print("‚úÖ Media download function defined")

### Scrape Posts and Download Media

This cell will:
1. Fetch posts from r/BrawlStars
2. Download any associated media (images/videos)
3. Store all data in a DataFrame
4. Save to CSV

‚è±Ô∏è **This may take 10-20 minutes depending on your connection speed**

In [None]:
print("="*70)
print("üöÄ STARTING DATA COLLECTION")
print("="*70)

# --- 1. Load Existing Data (if any) ---
try:
    df_existing = pd.read_csv(RAW_DATA_CSV)
    already_scraped_ids = set(df_existing['id'])
    print(f"\n‚úÖ Loaded {len(df_existing)} previously scraped posts")
except FileNotFoundError:
    df_existing = pd.DataFrame()
    already_scraped_ids = set()
    print(f"\nüìù No existing raw data found. Starting from scratch.")

# --- 2. Initialize Scraping ---
print(f"\nüìä Scraping configuration:")
print(f"   Target posts to fetch: {POST_LIMIT}")
print(f"   Already in dataset:    {len(already_scraped_ids)}")
print(f"   Gallery posts will be skipped")
print(f"\n‚èπÔ∏è  Press 'Kernel ‚Üí Interrupt' to stop at any time (progress is saved)\n")

all_posts_data = []
gallery_posts_skipped = 0
duplicate_posts_skipped = 0
subreddit = reddit.subreddit(SUBREDDIT_NAME)

# --- 3. Scrape Posts (Interruptible Loop) ---
try:
    print("="*70)
    print(f"üîç Fetching posts from r/{SUBREDDIT_NAME}...\n")
    
    # Use tqdm for a progress bar
    for post in tqdm(subreddit.hot(limit=POST_LIMIT), total=POST_LIMIT, desc="Scraping posts"):
        try:
            # Check if we already have this post
            if post.id in already_scraped_ids:
                duplicate_posts_skipped += 1
                continue  # Skip posts we already have
            
            # Check if it's a gallery post and skip it
            if is_gallery_post(post):
                gallery_posts_skipped += 1
                continue  # Skip gallery posts
            
            # 1. Download media and get local path
            local_media_path = download_media(post)
            
            # 2. Store all relevant data
            post_data = {
                'id': post.id,
                'title': post.title,
                'text': post.selftext,
                'url': post.url,
                'permalink': post.permalink,
                'score': post.score,
                'created_utc': post.created_utc,
                'post_hint': getattr(post, 'post_hint', 'text_only'),
                'local_media_path': local_media_path
            }
            all_posts_data.append(post_data)
            already_scraped_ids.add(post.id)  # Mark as scraped
            
        except Exception as e:
            print(f"\n‚ö†Ô∏è  Error processing post: {e}")

except KeyboardInterrupt:
    print("\n\n" + "="*70)
    print("‚èπÔ∏è  INTERRUPTED BY USER")
    print("="*70)
    print("Scraping stopped. Will save all posts collected so far...\n")

finally:
    # --- 4. Save Results (even if interrupted) ---
    
    if len(all_posts_data) == 0:
        print("\n‚ö†Ô∏è  No new posts were scraped in this session.")
        if len(df_existing) > 0:
            print(f"   Existing dataset: {len(df_existing)} posts")
            df_raw = df_existing  # Use existing data for next cell
    else:
        print(f"\nüìä Processing {len(all_posts_data)} newly scraped posts...")
        
        # Convert new posts to DataFrame
        df_new = pd.DataFrame(all_posts_data)
        
        # Combine with existing data
        if len(df_existing) > 0:
            df_combined = pd.concat([df_existing, df_new], ignore_index=True)
        else:
            df_combined = df_new
        
        # Remove any duplicates (just in case)
        df_combined = df_combined.drop_duplicates(subset=['id'], keep='first')
        
        # Save to CSV
        df_combined.to_csv(RAW_DATA_CSV, index=False)
        
        print(f"\nüíæ Dataset updated!")
        print(f"   New posts added:        {len(all_posts_data)}")
        print(f"   Total posts in dataset: {len(df_combined)}")
        print(f"   Saved to: {RAW_DATA_CSV}")
        
        print(f"\nüìã Session statistics:")
        print(f"   üö´ Gallery posts skipped:    {gallery_posts_skipped}")
        print(f"   ‚ôªÔ∏è  Duplicate posts skipped:  {duplicate_posts_skipped}")
        print(f"   ‚úÖ New posts added:          {len(all_posts_data)}")
        
        # Store the combined dataframe for the next cell
        df_raw = df_combined


### Data Summary & Statistics

In [None]:
# Make sure df_raw is loaded
if 'df_raw' not in locals() or df_raw is None:
    try:
        df_raw = pd.read_csv(RAW_DATA_CSV)
    except FileNotFoundError:
        print("‚ùå No data found. Please run the scraping cell first.")
        df_raw = pd.DataFrame()

if len(df_raw) == 0:
    print("‚ùå Dataset is empty. Please run the scraping cell first.")
else:

    print("\n" + "="*60)
    print("üìä DATA COLLECTION SUMMARY")
    print("="*60)
    
    print(f"\nüö´ Gallery posts skipped:     {gallery_posts_skipped}")
    print(f"   (Gallery posts contain multiple images and are excluded from analysis)")
    
    print("\n--- Sample of Collected Data ---")
    display(df_raw.head())
    
    print("\n--- Media Type Breakdown ---")
    media_counts = df_raw['post_hint'].value_counts()
    print(media_counts)
    
    print("\n--- Downloaded Media Statistics ---")
    total_posts = len(df_raw)
    posts_with_media = df_raw['local_media_path'].notna().sum()
    posts_without_media = total_posts - posts_with_media
    
    print(f"Total posts scraped:        {total_posts}")
    print(f"Posts with local media:     {posts_with_media} ({posts_with_media/total_posts*100:.1f}%)")
    print(f"Posts without media:        {posts_without_media} ({posts_without_media/total_posts*100:.1f}%)")
    
    # Count by media type
    images_count = len([p for p in df_raw['local_media_path'] if pd.notna(p) and 'images' in str(p)])
    videos_count = len([p for p in df_raw['local_media_path'] if pd.notna(p) and 'videos' in str(p)])
    
    print(f"\n  Images downloaded:        {images_count}")
    print(f"  Videos downloaded:        {videos_count}")
    
    print("\n" + "="*60)
    print("‚úÖ Step 1 Complete! You can now proceed to Step 2 (Labeling)")
    print("="*60)

---

## üîÑ Resumability & Expandability Features

This notebook is designed to be **interruptible, resumable, and expandable**:

### ‚úÖ Interruptible
- Press `Kernel ‚Üí Interrupt` or the stop button (‚èπÔ∏è) at any time
- All progress is saved automatically in the `finally` block
- No data loss even if interrupted

### ‚úÖ Resumable
- Re-run the scraping cell anytime to continue
- Automatically loads existing `raw_data.csv`
- Skips posts that are already in the dataset (checks by `id`)
- Only adds new posts

### ‚úÖ Expandable
- Run again tomorrow/next week to add more posts
- Appends to existing dataset instead of overwriting
- Perfect for building datasets over time
- Handles duplicates automatically

### üìä Example Usage Scenarios:

**Scenario 1: Initial Collection**
```
Day 1: Scrape 1,200 posts ‚Üí 1,000 added (200 were galleries)
Result: raw_data.csv has 1,000 posts
```

**Scenario 2: Interrupted & Resumed**
```
Day 1: Start scraping 1,200 posts
       After 500 posts ‚Üí Keyboard interrupt!
Result: raw_data.csv has 500 posts (saved)

Day 1: Re-run the cell
       Loads 500 existing posts
       Skips first 500 duplicates
       Adds next 700 posts
Result: raw_data.csv has 1,200 posts total
```

**Scenario 3: Expanding Dataset Over Time**
```
Day 1: Scrape hot posts ‚Üí 1,000 posts added
Day 2: Scrape hot posts again ‚Üí 200 new posts added (800 were duplicates)
Day 7: Scrape hot posts again ‚Üí 150 new posts added
Result: raw_data.csv grows from 1,000 ‚Üí 1,200 ‚Üí 1,350 posts
```

### üéØ Pro Tips:

1. **For maximum data collection**: Run this weekly to capture different "hot" posts
2. **For interruption recovery**: Just re-run - it knows where it left off
3. **For dataset growth**: Change `subreddit.hot()` to `subreddit.new()` or `subreddit.top('month')` to get different posts

---

---

## ‚úÖ Checkpoint

**What we accomplished:**
- ‚úÖ Set up project structure and directories
- ‚úÖ Connected to Reddit API
- ‚úÖ Scraped 1,000+ posts from r/BrawlStars
- ‚úÖ Downloaded images and videos locally
- ‚úÖ Saved data to `data/raw_data.csv`

**Next step:**
- üìù **Notebook 02**: AI-Powered Labeling with Gemini API

**Files created:**
- `data/raw_data.csv` - Contains post metadata and local media paths
- `media/images/*` - Downloaded images
- `media/videos/*` - Downloaded videos

---