# GitHub Issues Analysis with LangSmith Insights

This example demonstrates how to use LangSmith Insights to analyze GitHub issues from any repository. We'll:
1. Fetch all open issues from a GitHub repository
2. Store the issues in a CSV file for reference
3. Use LangSmith Insights to automatically cluster issues by theme (bugs, feature requests, documentation, etc.)

## Prerequisites

You will need:
1. A GitHub Personal Access Token (optional but recommended to avoid rate limits)

<details>
<summary><b>How to get your GitHub Personal Access Token</b> (click to expand)</summary>

1. Go to [GitHub Settings → Developer settings → Personal access tokens → Tokens (classic)](https://github.com/settings/tokens)
2. Click "Generate new token" → "Generate new token (classic)"
3. Give your token a descriptive name (e.g., "LangSmith Insights")
4. Select scopes:
   - For public repositories: No scopes needed (you can still generate a token for higher rate limits)
   - For private repositories: Check `repo` scope
5. Click "Generate token" and copy it immediately (you won't be able to see it again)

**Note**: Without a token, you're limited to 60 requests/hour. With a token, you get 5,000 requests/hour.

</details>

2. A LangSmith API key

<details>
<summary><b>How to get your LangSmith API key</b> (click to expand)</summary>

1. Go to [LangSmith Settings](https://smith.langchain.com/settings)
2. Click "Create API Key"
3. Copy your new API key

</details>

## Setup

Before running the notebook, set your API keys as environment variables in your terminal:
```bash
export GITHUB_TOKEN=your-github-token-here  # Optional but recommended
export LANGSMITH_API_KEY=your-langsmith-api-key
```

In [None]:
import csv
import os
import time
from datetime import datetime, timezone
from pathlib import Path

import requests
from langsmith import Client

In [None]:
DATA_DIR = Path("data")
DATA_DIR.mkdir(parents=True, exist_ok=True)

# TODO: Configure your GitHub repository below
GITHUB_REPO = "langchain-ai/langchain"  # Format: "owner/repo"

CSV_PATH = DATA_DIR / f"{GITHUB_REPO.replace('/', '_')}_open_issues.csv"

print(f"Configuration: repo='{GITHUB_REPO}'")
print(f"CSV will be saved to: {CSV_PATH}")

## Fetch GitHub Issues

We'll fetch all open issues from the repository (excluding pull requests). This may take a few minutes for repositories with many issues.

In [None]:
GITHUB_TOKEN = os.getenv("GITHUB_TOKEN")

issues = []
session = requests.Session()
headers = {
    "Accept": "application/vnd.github+json",
    "User-Agent": "langsmith-insights-notebook"
}

if GITHUB_TOKEN:
    headers["Authorization"] = f"Bearer {GITHUB_TOKEN}"
    print("Using GitHub token for authentication (5,000 requests/hour)")
else:
    print("No GitHub token provided. Using unauthenticated requests (60 requests/hour)")
    print("Set GITHUB_TOKEN environment variable for higher rate limits")

page = 1
while True:
    params = {"state": "open", "per_page": 100, "page": page}
    response = session.get(
        f"https://api.github.com/repos/{GITHUB_REPO}/issues",
        headers=headers,
        params=params
    )
    
    # Handle rate limiting
    if response.status_code == 403 and response.headers.get("X-RateLimit-Remaining") == "0":
        reset_ts = int(response.headers.get("X-RateLimit-Reset", "0"))
        sleep_for = max(reset_ts - int(time.time()), 0) + 1
        print(f"Rate limit hit. Sleeping for {sleep_for} seconds...")
        time.sleep(sleep_for)
        continue
    
    response.raise_for_status()
    page_items = response.json()
    
    if not page_items:
        break
    
    # Filter out pull requests (they also appear in the issues endpoint)
    non_pr_items = [issue for issue in page_items if "pull_request" not in issue]
    issues.extend(non_pr_items)
    print(f"Fetched page {page}: {len(non_pr_items)} issues (total: {len(issues)})")
    
    if len(page_items) < 100:
        break
    
    page += 1

print(f"\nCollected {len(issues)} open issues from {GITHUB_REPO}")

## Preview Issue Data

Let's look at an example issue to see what data we have:

In [None]:
if issues:
    example = issues[0]
    print(f"Issue #{example['number']}: {example['title']}")
    print(f"Author: {example['user']['login']}")
    print(f"Labels: {', '.join(label['name'] for label in example['labels']) or 'None'}")
    print(f"Comments: {example['comments']}")
    print(f"Created: {example['created_at']}")
    print(f"URL: {example['html_url']}")
    print(f"\nBody (first 300 chars):\n{(example['body'] or 'No description')[:300]}...")

## Save Issues to CSV

Save the issues locally for reference and future analysis.

In [None]:
if issues:
    fieldnames = [
        "id", "number", "title", "body", "user", "labels",
        "state", "comments", "created_at", "updated_at", "html_url"
    ]
    
    with CSV_PATH.open("w", newline="", encoding="utf-8") as csv_file:
        writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
        writer.writeheader()
        for issue in issues:
            labels = ", ".join(label["name"] for label in issue.get("labels", []))
            writer.writerow({
                "id": issue.get("id"),
                "number": issue.get("number"),
                "title": issue.get("title"),
                "body": issue.get("body"),
                "user": issue.get("user", {}).get("login"),
                "labels": labels,
                "state": issue.get("state"),
                "comments": issue.get("comments"),
                "created_at": issue.get("created_at"),
                "updated_at": issue.get("updated_at"),
                "html_url": issue.get("html_url"),
            })
    print(f"Saved {len(issues)} issues to {CSV_PATH}")
else:
    print("No issues to save")

## Analyze with LangSmith Insights

Now we'll send the issues to LangSmith Insights for automatic clustering and analysis. The Insights API will:
- Identify common themes across issues (bugs, feature requests, documentation, infrastructure, etc.)
- Group similar issues together
- Generate summaries for each cluster
- Provide a visual interface to explore the results

This is particularly useful for:
- Understanding what users are requesting most
- Identifying patterns in bug reports
- Prioritizing roadmap items
- Getting a high-level view of repository health

In [None]:
# Format issues for the Insights API
# Each issue is converted to a chat history format with structured information
chat_histories = []

for issue in issues:
    labels = ", ".join(label["name"] for label in issue.get("labels", [])) or "(no labels)"
    body = issue.get("body") or "(no body provided)"
    
    content = (
        f"GitHub Issue #{issue.get('number')}\n"
        f"Title: {issue.get('title')}\n"
        f"Labels: {labels}\n"
        f"URL: {issue.get('html_url')}\n"
        f"Comments: {issue.get('comments')} | Created: {issue.get('created_at')}\n\n"
        f"Body:\n{body}"
    )
    
    chat_histories.append([{"role": "user", "content": content}])

print(f"Prepared {len(chat_histories)} issues for analysis")
if chat_histories:
    print(f"\nExample formatted issue:\n{chat_histories[0][0]['content'][:500]}...")

In [None]:
LANGSMITH_API_KEY = os.getenv("LANGSMITH_API_KEY")
if not LANGSMITH_API_KEY:
    raise RuntimeError(
        "LANGSMITH_API_KEY environment variable is required. "
        "Get an API key at https://smith.langchain.com/settings"
    )

client = Client(api_key=LANGSMITH_API_KEY)

report = client.generate_insights(
    chat_histories=chat_histories,
    instructions=(
        f"These are open issues from the {GITHUB_REPO} GitHub repository. "
        "Cluster them by theme (bugs, feature requests, documentation, infrastructure, etc.). "
        "Help identify patterns in what users are requesting and reporting."
    ),
    name=f"{GITHUB_REPO.replace('/', '-')}-issues-{datetime.now(timezone.utc):%Y%m%d-%H%M}",
)

print("\nInsights report generated successfully!")
print(f"View your results at: {report.url if hasattr(report, 'url') else 'Check LangSmith UI'}")