ig-scraper

Instagram scraping tool for collecting account data, posts, and comments for analysis using the Instaloader library.

Features

Profile scraping: Fetch user profiles with follower counts, bio, verification status
Post collection: Download media, captions, and metadata for posts
Comment extraction: Full cursor-based pagination for comment threads
Media downloads: Photos, videos, albums (carousels), and reels via Instaloader
Analysis generation: Automated markdown analysis of account patterns (hooks, formats, CTAs, top posts)
Type-safe models: Dataclass-based data models for Profile, Post, PostResource, and Comment
Retry logic: Exponential backoff for transient API failures
Structured logging: Dual-output (console INFO + file DEBUG) with timestamped log files

Requirements

Python 3.12+
Instagram account credentials or a valid browser-exported Instagram session cookie

Installation

# Clone the repository
git clone https://github.com/<your-repo>/ig-scraper
cd ig-scraper

# Install dependencies (requires uv)
uv sync

# Install dev dependencies
uv sync --group dev

Configuration

Create a .env file in the project root. The current implementation supports either a cookie-backed session bootstrap or a username/password login:

INSTAGRAM_SESSIONID=your_session_id_here
# OR
INSTAGRAM_USERNAME=your_username
INSTAGRAM_PASSWORD=your_password

When INSTAGRAM_SESSIONID is set, the scraper currently expects browser-exported cookies in cookies.txt at the repo root and uses them to seed an authenticated Instaloader session. When username/password credentials are provided, Instaloader logs in directly and saves a reusable session file under its standard session-file location.

Exporting Browser Cookies

Log into Instagram in your browser
Export the Instagram cookies to a cookies.txt file at the repo root
Set INSTAGRAM_SESSIONID in .env so the scraper enables cookie-backed session auth

Optional Tuning Variables

Variable	Default	Description
`IG_REQUEST_TIMEOUT_SECONDS`	`30`	Timeout for Instagram API requests
`IG_COMMENTS_PAGE_SIZE`	`250`	Number of comments fetched per pagination page
`IG_REQUEST_PAUSE_SECONDS`	`0.25`	Delay between requests to avoid rate limiting
`IG_COMMENT_PAGE_RETRIES`	`3`	Retry attempts for comment pagination failures
`IG_MEDIA_DOWNLOAD_RETRIES`	`3`	Retry attempts for media download failures

Usage

CLI

# Scrape a single handle
uv run python -m ig_scraper --handles @username

# Scrape multiple handles
uv run python -m ig_scraper --handles @user1,@user2,@user3

# Scrape all handles from resources/instagram_handles.md
uv run python -m ig_scraper --all

# Limit posts per handle (default: 100)
uv run python -m ig_scraper --handles @username --max-posts-per-handle 50

Note: Either --handles or --all is required. The tool authenticates once per run and reuses the client for that run.

Programmatic

from ig_scraper import fetch_profile_posts_and_comments, get_instaloader_client

client = get_instaloader_client()
profile, posts, comments = fetch_profile_posts_and_comments("username", posts_per_profile=50)

Output Structure

All output goes under data/accounts/@<handle>/:

data/accounts/@username/
├── analysis.md          # Generated account analysis report
├── raw-posts.json       # All posts as JSON array
├── raw-comments.json    # All comments as JSON array
├── posts/
│   └── 001_AbCdEfGh/    # One folder per post (index_shortcode)
│       ├── metadata.json
│       ├── comments.json
│       ├── caption.txt
│       └── media/       # Downloaded media files
└── swipes/
    └── post-01.md       # Swipe-worthy post summaries

Analysis Report (`analysis.md`)

Generated markdown covering:

Account Profile — bio, follower/following counts, format mix
Pattern Observations — hook patterns, caption themes, hashtag/mention frequency, CTA patterns, comment analysis
Swipe-Worthy Posts — top 5 posts by engagement with hook + caption summary
What Doesn't Work — access gaps and data limitations
Strategy Implications — dominant themes and format mix
Access Log — scraping method and constraint notes

Post JSON Schema

{
  "id": "string",
  "pk": "string",
  "short_code": "string",
  "url": "https://www.instagram.com/p/AbCdEfGh/",
  "type": "reel/video | carousel | ... ",
  "caption": "string",
  "comment_count": 0,
  "like_count": 0,
  "taken_at": "ISO8601",
  "owner_username": "string",
  "owner_full_name": "string",
  "owner_id": "string",
  "video_url": "string",
  "thumbnail_url": "string",
  "is_video": false,
  "mentions": ["@user"],
  "hashtags": ["#tag"],
  "resources": [{ "pk", "media_type", "thumbnail_url", "video_url" }],
  "media_files": ["file.jpg"],
  "post_folder": "001_AbCdEfGh",
  "from_url": "https://www.instagram.com/username/",
  "_profile": { ... }
}

Comment JSON Schema

{
  "post_url": "https://www.instagram.com/p/AbCdEfGh/",
  "comment_url": "https://www.instagram.com/p/AbCdEfGh/#comment-123",
  "id": "123456789",
  "text": "Great post!",
  "owner_username": "commenter",
  "owner_full_name": "Commenter Name",
  "owner_profile_pic_url": "https://...",
  "timestamp": "ISO8601",
  "likes_count": 0,
  "replies_count": 0
}

Project Structure

ig-scraper/
├── src/ig_scraper/
│   ├── __init__.py              # Public API: exports models + core functions
│   ├── __main__.py              # python -m ig_scraper entry point
│   ├── analysis.py              # Caption/post analysis helpers (hooks, hashtags, top_words)
│   ├── analysis_io.py           # I/O utilities + analysis constants (CTA_TOKENS, HOOK_WORDS)
│   ├── analysis_render.py       # Markdown report builder (build_analysis_markdown)
│   ├── cli.py                   # argparse CLI, handle loop, status tracking
│   ├── client.py                # Instaloader auth bootstrap + account verification
│   ├── comments.py              # Cursor-based comment pagination (media_comments_chunk)
│   ├── config.py                # Environment variable overrides + rate-limit constants
│   ├── exceptions.py            # Exception hierarchy + classify_exception()
│   ├── logging_utils.py         # Dual-handler logging (console INFO, file DEBUG)
│   ├── media.py                 # Media download dispatch (album/photo/clip/video)
│   ├── media_processing.py      # Per-post pipeline: download → post dict → comments
│   ├── models/
│   │   ├── __init__.py
│   │   ├── comment.py           # Comment dataclass
│   │   ├── post.py              # Post + PostResource dataclasses
│   │   └── profile.py           # Profile dataclass
│   ├── paths.py                 # Centralized Path constants (ROOT, DATA_DIR, etc.)
│   ├── retry.py                 # Exponential backoff (retry_on decorator + _retry_with_backoff)
│   ├── run_scrape.py            # Per-handle orchestration: scrape → write artifacts → update README
│   └── scraper.py               # Profile + media fetch loop (fetch_profile_posts_and_comments)
├── tests/
│   ├── conftest.py              # Pytest fixtures
│   ├── factories.py             # Polyfactory data factories
│   ├── features/                # BDD .feature files
│   ├── test_*.py                # Unit/integration tests
│   └── test_regressions/        # Golden/regression test data
├── scripts/
│   ├── check_all.py             # Runs: ruff check, ruff format --check, mypy, pytest
│   └── check_file_length.py     # Enforces 200-line file limit
├── resources/
│   └── instagram_handles.md     # Handles list for --all (one @handle per line)
├── data/
│   ├── accounts/                # Scraped data (gitignored)
│   └── README.md                # Data corpus documentation
├── pyproject.toml               # Project metadata + tool configuration
├── uv.lock                      # Locked dependencies
└── README.md                    # This file

Development

Running Tests

# All tests
uv run pytest

# With coverage
uv run pytest --cov

# Specific file
uv run pytest tests/test_scraper.py

# Slow tests only
uv run pytest -m slow

# Skip integration tests
uv run pytest -m "not integration"

Code Quality

# Lint
uv run ruff check .

# Format check
uv run ruff format --check .

# Format + apply fixes
uv run ruff format .

# Type check (mypy)
uv run mypy src/

# All checks (ruff lint, ruff format check, mypy, pytest, file length)
uv run python scripts/check_all.py

# Pre-commit hooks (runs check_all.py on every commit)
uv run pre-commit run --all-files

Exception Handling Strategy

Exception	Classification	Behavior
`RuntimeError`	Retryable	Exponential backoff retry
`ConnectionError`	Retryable	Exponential backoff retry
`TimeoutError`	Retryable	Exponential backoff retry
`LoginRequired`, `AuthError`	Fatal	Auth error raised immediately
`IgScraperError` subclasses	Fatal	Logged and re-raised

Graceful degradation: media download failures don't stop the scrape; comment pagination failures don't prevent post metadata from being saved.

Architecture

Data Flow

cli.main()
  └─> load_handles()          # Read @handles from resources/instagram_handles.md
  └─> initialize_readme()     # Create/update data/README.md status table
  └─> for each handle:
        process_handle()
          ├─> clean_handle()              # Normalize @username → username
          ├─> fetch_profile_posts_and_comments()
          │     ├─> get_instaloader_client()      # Auth once per run (cookie or password)
          │     ├─> Profile.from_username()        # Fetch profile info
          │     ├─> profile.get_posts() → _take_n() # Fetch media list (patched doc_id)
          │     └─> for each media:
          │           _process_single_media()
          │             ├─> _download_media()      # instaloader download dispatch
          │             ├─> _build_post_dict()    # Post dataclass → dict
          │             └─> _fetch_all_comments()   # Paginated via patched get_comments()
          ├─> write_post_artifacts()     # metadata.json, comments.json, caption.txt, media/
          ├─> write_json(raw-posts.json, raw-comments.json)
          ├─> build_analysis_markdown()  # analysis.md report
          └─> update_readme_status()    # Mark handle as "analyzed" in data/README.md

Authentication

client.py loads credentials from .env and authenticates via instaloader. Cookie-backed sessions (from cookies.txt) are tried first, falling back to username/password login. Monkey-patches in patch.py fix broken upstream endpoints for get_posts() and get_comments().

Retry Mechanism

retry.py provides two APIs:

@retry_on(*exc_types, max_attempts, wait_base_seconds) — decorator for clean retry wrapping
_retry_with_backoff() — legacy function-based API

Wait formula: wait_seconds = REQUEST_PAUSE_SECONDS * (2 ** attempt)

Logging

logging_utils.py configures dual output on first logger access:

Console: INFO level, human-readable %(asctime)s | %(levelname)-7s | %(name)s | %(message)s
File: DEBUG level, timestamped at logs/YYYY-MM-DD_HH-MM-SS.log

All log lines use format_kv() to produce pipe-delimited key=value pairs for grep-friendly traceability.

Dependencies

Runtime

Package	Version	Purpose
instaloader	>=4.15.1	Instagram API client
python-dotenv	>=1.2.0	`.env` file loading
Pillow	>=8.1.1	Image processing
requests	>=2.33.0	HTTP library

Development

Package	Purpose
pytest + pytest-cov + pytest-mock + pytest-regressions + pytest-bdd	Testing
responses	HTTP mocking
mypy	Static type checking
ruff	Linting and formatting
ty	Fast type checker (Astral)
bandit	Security linting
invoke	Task automation
prospector	Comprehensive code analysis
hypothesis + polyfactory	Property-based testing + data factories
inline-snapshot	Snapshot testing
pip-audit	Dependency vulnerability scanning

Troubleshooting

Authentication Failed

AuthError: Instagram authentication failed: ...

Verify your INSTAGRAM_SESSIONID is current — Instagram sessions expire
Ensure the account isn't temporarily suspended or rate-limited
Check that the sessionid cookie is valid and not expired

Rate Limiting

Instagram aggressively throttles scraping. The tool handles this with:

Request pauses (0.25s default between operations)
Exponential backoff on RuntimeError / ConnectionError
Full cursor-based comment pagination

If you hit hard limits:

Wait 15–30 minutes before retrying
Lower --max-posts-per-handle
Use a different Instagram account's session

Partial Data

If some posts lack comments or media:

The tool logs a warning and continues — it never aborts a scrape for partial data
Brand/private/restricted posts may return sparse captions or no comments
Check data/accounts/@username/posts/<shortcode>/ for per-post metadata.json with _profile access metadata

Contributing

Fork and create a feature branch
Make changes with type annotations
Run uv run python scripts/check_all.py — all checks must pass
Commit with a clear message
Open a Pull Request

License

[Your License Here]

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
.github		.github
data		data
docs		docs
resources		resources
scripts		scripts
src/ig_scraper		src/ig_scraper
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
README.md		README.md
pyproject.toml		pyproject.toml
tasks.py		tasks.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

ig-scraper

Features

Requirements

Installation

Configuration

Exporting Browser Cookies

Optional Tuning Variables

Usage

CLI

Programmatic

Output Structure

Analysis Report (analysis.md)

Post JSON Schema

Comment JSON Schema

Project Structure

Development

Running Tests

Code Quality

Exception Handling Strategy

Architecture

Data Flow

Authentication

Retry Mechanism

Logging

Dependencies

Runtime

Development

Troubleshooting

Authentication Failed

Rate Limiting

Partial Data

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Analysis Report (`analysis.md`)

Packages