# OpenAlex Data Import Notebook

This notebook demonstrates how to import and work with data from the OpenAlex API. OpenAlex is a comprehensive, open database of academic works, authors, institutions, concepts, and more.

## OpenAlex API Overview
- **Base URL**: https://api.openalex.org/
- **Open Access**: No authentication required for basic usage
- **Rate Limiting**: Polite usage recommended (1 request per second)
- **Coverage**: 240M+ works, 90M+ authors, 100K+ institutions

## What we'll cover:
1. Setting up API connections
2. Fetching different types of data (works, authors, institutions)
3. Handling pagination for large datasets
4. Processing and cleaning data
5. Exporting data for analysis

## 1. Import Required Libraries

In [None]:
# Essential libraries for API interaction and data handling
import requests
import pandas as pd
import json
import time
from typing import Dict, List, Optional, Any
from urllib.parse import urljoin, urlencode
import warnings
warnings.filterwarnings('ignore')

# For data visualization (optional)
try:
    import matplotlib.pyplot as plt
    import seaborn as sns
    print("Visualization libraries loaded successfully")
except ImportError:
    print("Visualization libraries not available - install matplotlib and seaborn for plots")

print("All required libraries imported successfully!")

## 2. Set Up API Configuration

In [None]:
# OpenAlex API Configuration
class OpenAlexConfig:
    """Configuration class for OpenAlex API"""
    
    BASE_URL = "https://api.openalex.org"
    
    # Common headers - add your email for higher rate limits
    HEADERS = {
        "User-Agent": "KnowledgeFabric/1.0 (mailto:your-email@domain.com)",
        "Accept": "application/json"
    }
    
    # Rate limiting - be polite to the API
    REQUEST_DELAY = 1.1  # seconds between requests (slightly above 1/sec)
    
    # Common parameters
    DEFAULT_PARAMS = {
        "per-page": 25,  # results per page (max 200)
        "mailto": "your-email@domain.com"  # for polite pool access
    }

# Initialize configuration
config = OpenAlexConfig()
print(f"OpenAlex API Base URL: {config.BASE_URL}")
print(f"Default results per page: {config.DEFAULT_PARAMS['per-page']}")
print(f"Request delay: {config.REQUEST_DELAY} seconds")

## 3. Basic API Request Functions

In [None]:
def make_openalex_request(endpoint: str, params: Dict = None, delay: bool = True) -> Dict:
    """
    Make a request to the OpenAlex API with error handling and rate limiting.
    
    Args:
        endpoint: API endpoint (e.g., 'works', 'authors', 'institutions')
        params: Query parameters
        delay: Whether to add delay for rate limiting
    
    Returns:
        JSON response as dictionary
    """
    if params is None:
        params = {}
    
    # Merge with default parameters
    final_params = {**config.DEFAULT_PARAMS, **params}
    
    # Build URL
    url = urljoin(config.BASE_URL, endpoint)
    
    try:
        # Rate limiting
        if delay:
            time.sleep(config.REQUEST_DELAY)
        
        # Make request
        response = requests.get(url, headers=config.HEADERS, params=final_params)
        response.raise_for_status()
        
        return response.json()
        
    except requests.exceptions.RequestException as e:
        print(f"Error making request to {url}: {e}")
        return None

def test_api_connection():
    """Test the API connection with a simple request"""
    print("Testing OpenAlex API connection...")
    
    response = make_openalex_request("works", {"filter": "publication_year:2023", "per-page": 1})
    
    if response and "results" in response:
        print("✅ API connection successful!")
        print(f"Total works in 2023: {response['meta']['count']:,}")
        return True
    else:
        print("❌ API connection failed!")
        return False

# Test the connection
test_api_connection()

## 4. Fetch Works Data

In [None]:
def fetch_works(filters: Dict = None, limit: int = 100) -> pd.DataFrame:
    """
    Fetch academic works from OpenAlex API.
    
    Args:
        filters: Dictionary of filters to apply
        limit: Maximum number of results to fetch
    
    Returns:
        DataFrame with works data
    """
    works_data = []
    per_page = min(25, limit)  # API max is 200, but start with 25
    pages_needed = (limit + per_page - 1) // per_page
    
    print(f"Fetching {limit} works across {pages_needed} pages...")
    
    for page in range(1, pages_needed + 1):
        params = {
            "per-page": per_page,
            "page": page
        }
        
        # Add filters if provided
        if filters:
            filter_string = ",".join([f"{k}:{v}" for k, v in filters.items()])
            params["filter"] = filter_string
        
        response = make_openalex_request("works", params)
        
        if not response or "results" not in response:
            print(f"Failed to fetch page {page}")
            break
        
        # Extract relevant fields from each work
        for work in response["results"]:
            work_data = {
                "id": work.get("id", "").replace("https://openalex.org/", ""),
                "doi": work.get("doi"),
                "title": work.get("title"),
                "publication_year": work.get("publication_year"),
                "publication_date": work.get("publication_date"),
                "type": work.get("type"),
                "cited_by_count": work.get("cited_by_count", 0),
                "is_open_access": work.get("open_access", {}).get("is_oa", False),
                "language": work.get("language"),
                "primary_location": work.get("primary_location", {}).get("source", {}).get("display_name") if work.get("primary_location") else None,
                "corresponding_author_ids": [auth.get("author", {}).get("id", "").replace("https://openalex.org/", "") 
                                           for auth in work.get("authorships", []) if auth.get("is_corresponding")],
                "institution_ids": list(set([inst.get("id", "").replace("https://openalex.org/", "") 
                                           for auth in work.get("authorships", []) 
                                           for inst in auth.get("institutions", [])])),
                "concept_ids": [concept.get("id", "").replace("https://openalex.org/", "") 
                              for concept in work.get("concepts", [])],
                "abstract": work.get("abstract"),
                "abstract_inverted_index": len(work.get("abstract_inverted_index", {})) > 0
            }
            works_data.append(work_data)
        
        print(f"Fetched page {page}/{pages_needed} - {len(works_data)} total works")
        
        # Stop if we've reached the limit
        if len(works_data) >= limit:
            works_data = works_data[:limit]
            break
    
    return pd.DataFrame(works_data)

# Example: Fetch recent AI/ML papers
print("Example 1: Fetching recent AI/ML papers...")
ai_works = fetch_works(
    filters={
        "concepts.id": "C154945302",  # Artificial Intelligence concept ID
        "publication_year": "2023",
        "type": "article"
    },
    limit=50
)

print(f"\nFetched {len(ai_works)} AI papers:")
print(ai_works[["title", "publication_year", "cited_by_count", "is_open_access"]].head())

## 5. Fetch Authors Data

In [None]:
def fetch_authors(filters: Dict = None, limit: int = 50) -> pd.DataFrame:
    """
    Fetch author information from OpenAlex API.
    
    Args:
        filters: Dictionary of filters to apply
        limit: Maximum number of results to fetch
    
    Returns:
        DataFrame with author data
    """
    authors_data = []
    per_page = min(25, limit)
    pages_needed = (limit + per_page - 1) // per_page
    
    print(f"Fetching {limit} authors across {pages_needed} pages...")
    
    for page in range(1, pages_needed + 1):
        params = {
            "per-page": per_page,
            "page": page
        }
        
        # Add filters if provided
        if filters:
            filter_string = ",".join([f"{k}:{v}" for k, v in filters.items()])
            params["filter"] = filter_string
        
        response = make_openalex_request("authors", params)
        
        if not response or "results" not in response:
            print(f"Failed to fetch page {page}")
            break
        
        # Extract relevant fields from each author
        for author in response["results"]:
            # Get current affiliation
            current_affiliation = None
            if author.get("affiliations"):
                current_affiliation = author["affiliations"][0].get("institution", {}).get("display_name")
            
            author_data = {
                "id": author.get("id", "").replace("https://openalex.org/", ""),
                "orcid": author.get("orcid"),
                "display_name": author.get("display_name"),
                "display_name_alternatives": author.get("display_name_alternatives", []),
                "works_count": author.get("works_count", 0),
                "cited_by_count": author.get("cited_by_count", 0),
                "h_index": author.get("summary_stats", {}).get("h_index", 0),
                "i10_index": author.get("summary_stats", {}).get("i10_index", 0),
                "current_affiliation": current_affiliation,
                "affiliation_history": [aff.get("institution", {}).get("display_name") 
                                      for aff in author.get("affiliations", [])],
                "concept_ids": [concept.get("id", "").replace("https://openalex.org/", "") 
                              for concept in author.get("x_concepts", [])],
                "last_known_institution": author.get("last_known_institution", {}).get("display_name"),
                "first_publication_year": author.get("summary_stats", {}).get("2yr_mean_citedness"),
                "two_year_mean_citedness": author.get("summary_stats", {}).get("2yr_mean_citedness")
            }
            authors_data.append(author_data)
        
        print(f"Fetched page {page}/{pages_needed} - {len(authors_data)} total authors")
        
        # Stop if we've reached the limit
        if len(authors_data) >= limit:
            authors_data = authors_data[:limit]
            break
    
    return pd.DataFrame(authors_data)

# Example: Fetch highly cited authors in computer science
print("Example 2: Fetching highly cited computer science authors...")
cs_authors = fetch_authors(
    filters={
        "x_concepts.id": "C41008148",  # Computer Science concept ID
        "cited_by_count": ">1000"
    },
    limit=25
)

print(f"\nFetched {len(cs_authors)} highly cited CS authors:")
print(cs_authors[["display_name", "works_count", "cited_by_count", "h_index", "current_affiliation"]].head())

## 6. Fetch Institutions Data

In [None]:
def fetch_institutions(filters: Dict = None, limit: int = 50) -> pd.DataFrame:
    """
    Fetch institution information from OpenAlex API.
    
    Args:
        filters: Dictionary of filters to apply
        limit: Maximum number of results to fetch
    
    Returns:
        DataFrame with institution data
    """
    institutions_data = []
    per_page = min(25, limit)
    pages_needed = (limit + per_page - 1) // per_page
    
    print(f"Fetching {limit} institutions across {pages_needed} pages...")
    
    for page in range(1, pages_needed + 1):
        params = {
            "per-page": per_page,
            "page": page
        }
        
        # Add filters if provided
        if filters:
            filter_string = ",".join([f"{k}:{v}" for k, v in filters.items()])
            params["filter"] = filter_string
        
        response = make_openalex_request("institutions", params)
        
        if not response or "results" not in response:
            print(f"Failed to fetch page {page}")
            break
        
        # Extract relevant fields from each institution
        for institution in response["results"]:
            geo = institution.get("geo", {})
            
            institution_data = {
                "id": institution.get("id", "").replace("https://openalex.org/", ""),
                "ror": institution.get("ror"),
                "display_name": institution.get("display_name"),
                "country_code": institution.get("country_code"),
                "type": institution.get("type"),
                "homepage_url": institution.get("homepage_url"),
                "image_url": institution.get("image_url"),
                "works_count": institution.get("works_count", 0),
                "cited_by_count": institution.get("cited_by_count", 0),
                "h_index": institution.get("summary_stats", {}).get("h_index", 0),
                "i10_index": institution.get("summary_stats", {}).get("i10_index", 0),
                "two_year_mean_citedness": institution.get("summary_stats", {}).get("2yr_mean_citedness"),
                "city": geo.get("city"),
                "region": geo.get("region"),
                "country": geo.get("country"),
                "latitude": geo.get("latitude"),
                "longitude": geo.get("longitude"),
                "associated_institutions": [assoc.get("id", "").replace("https://openalex.org/", "") 
                                          for assoc in institution.get("associated_institutions", [])],
                "concept_ids": [concept.get("id", "").replace("https://openalex.org/", "") 
                              for concept in institution.get("x_concepts", [])]
            }
            institutions_data.append(institution_data)
        
        print(f"Fetched page {page}/{pages_needed} - {len(institutions_data)} total institutions")
        
        # Stop if we've reached the limit
        if len(institutions_data) >= limit:
            institutions_data = institutions_data[:limit]
            break
    
    return pd.DataFrame(institutions_data)

# Example: Fetch top research universities
print("Example 3: Fetching top research universities...")
top_universities = fetch_institutions(
    filters={
        "type": "education",
        "works_count": ">10000"
    },
    limit=20
)

print(f"\nFetched {len(top_universities)} top universities:")
print(top_universities[["display_name", "country", "works_count", "cited_by_count", "h_index"]].head())

## 7. Handle Pagination

In [None]:
def fetch_all_results_with_cursor(endpoint: str, filters: Dict = None, max_results: int = None) -> List[Dict]:
    """
    Fetch all results using cursor-based pagination for large datasets.
    
    Args:
        endpoint: API endpoint ('works', 'authors', 'institutions')
        filters: Dictionary of filters to apply
        max_results: Maximum number of results to fetch (None for all)
    
    Returns:
        List of all results
    """
    all_results = []
    cursor = "*"  # Start with initial cursor
    per_page = 200  # Maximum allowed per page
    
    print(f"Fetching all results from {endpoint} endpoint...")
    
    while cursor and (max_results is None or len(all_results) < max_results):
        params = {
            "per-page": per_page,
            "cursor": cursor
        }
        
        # Add filters if provided
        if filters:
            filter_string = ",".join([f"{k}:{v}" for k, v in filters.items()])
            params["filter"] = filter_string
        
        response = make_openalex_request(endpoint, params)
        
        if not response or "results" not in response:
            print(f"Failed to fetch results")
            break
        
        # Add results to our collection
        batch_results = response["results"]
        if max_results:
            remaining = max_results - len(all_results)
            batch_results = batch_results[:remaining]
        
        all_results.extend(batch_results)
        
        # Get next cursor
        meta = response.get("meta", {})
        cursor = meta.get("next_cursor")
        
        print(f"Fetched {len(batch_results)} results - Total: {len(all_results)}")
        
        # Stop if no more pages or we've reached max
        if not cursor or (max_results and len(all_results) >= max_results):
            break
    
    print(f"Completed fetching {len(all_results)} total results")
    return all_results

# Example: Fetch a large dataset of recent open access papers
print("Example 4: Fetching large dataset with cursor pagination...")
print("Note: This will take a while and fetch many results. Adjust max_results as needed.")

# Fetch 1000 recent open access papers (you can increase this number)
oa_papers = fetch_all_results_with_cursor(
    endpoint="works",
    filters={
        "is_oa": "true",
        "publication_year": "2023",
        "type": "article"
    },
    max_results=1000  # Adjust this number based on your needs
)

print(f"Fetched {len(oa_papers)} open access papers from 2023")

## 8. Data Processing and Cleaning

In [None]:
def process_works_data(raw_works: List[Dict]) -> pd.DataFrame:
    """
    Process raw works data into a clean DataFrame with expanded nested fields.
    
    Args:
        raw_works: List of raw work dictionaries from API
        
    Returns:
        Cleaned DataFrame with processed fields
    """
    processed_data = []
    
    print(f"Processing {len(raw_works)} works...")
    
    for work in raw_works:
        # Basic fields
        processed_work = {
            "openalex_id": work.get("id", "").replace("https://openalex.org/", ""),
            "doi": work.get("doi"),
            "title": work.get("title"),
            "publication_year": work.get("publication_year"),
            "publication_date": work.get("publication_date"),
            "type": work.get("type"),
            "cited_by_count": work.get("cited_by_count", 0),
            "language": work.get("language"),
        }
        
        # Open access information
        oa_info = work.get("open_access", {})
        processed_work.update({
            "is_oa": oa_info.get("is_oa", False),
            "oa_url": oa_info.get("oa_url"),
            "any_repository_has_fulltext": oa_info.get("any_repository_has_fulltext", False)
        })
        
        # Primary location (journal/venue)
        primary_location = work.get("primary_location", {})
        if primary_location:
            source = primary_location.get("source", {})
            processed_work.update({
                "journal_name": source.get("display_name"),
                "journal_issn_l": source.get("issn_l"),
                "journal_is_oa": source.get("is_oa", False),
                "journal_type": source.get("type")
            })
        
        # Authors information
        authorships = work.get("authorships", [])
        processed_work.update({
            "author_count": len(authorships),
            "author_names": [auth.get("author", {}).get("display_name") for auth in authorships],
            "corresponding_author_count": sum(1 for auth in authorships if auth.get("is_corresponding")),
            "institution_count": len(set([inst.get("id") for auth in authorships for inst in auth.get("institutions", []) if inst.get("id")]))
        })
        
        # Concepts (subject areas)
        concepts = work.get("concepts", [])
        if concepts:
            # Get top 3 concepts by score
            top_concepts = sorted(concepts, key=lambda x: x.get("score", 0), reverse=True)[:3]
            processed_work.update({
                "primary_concept": top_concepts[0].get("display_name") if top_concepts else None,
                "concept_score": top_concepts[0].get("score") if top_concepts else None,
                "all_concepts": [c.get("display_name") for c in concepts]
            })
        
        # Abstract information
        processed_work.update({
            "has_abstract": bool(work.get("abstract")),
            "abstract_length": len(work.get("abstract", "")),
            "has_inverted_abstract": len(work.get("abstract_inverted_index", {})) > 0
        })
        
        processed_data.append(processed_work)
    
    df = pd.DataFrame(processed_data)
    
    # Convert date columns
    if "publication_date" in df.columns:
        df["publication_date"] = pd.to_datetime(df["publication_date"], errors="coerce")
    
    print(f"Processed {len(df)} works successfully")
    print(f"Columns: {list(df.columns)}")
    
    return df

def analyze_data_quality(df: pd.DataFrame, entity_type: str = "works") -> None:
    """
    Analyze data quality and completeness of the fetched data.
    
    Args:
        df: DataFrame to analyze
        entity_type: Type of entities ('works', 'authors', 'institutions')
    """
    print(f"\n=== Data Quality Analysis for {entity_type.title()} ===")
    print(f"Total records: {len(df)}")
    print(f"Total columns: {len(df.columns)}")
    
    # Missing values analysis
    print("\nMissing values per column:")
    missing = df.isnull().sum()
    missing_pct = (missing / len(df)) * 100
    
    for col in df.columns:
        if missing[col] > 0:
            print(f"  {col}: {missing[col]} ({missing_pct[col]:.1f}%)")
    
    # Basic statistics for numeric columns
    numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
    if len(numeric_cols) > 0:
        print(f"\nNumeric column statistics:")
        print(df[numeric_cols].describe())
    
    # Duplicates
    duplicates = df.duplicated().sum()
    print(f"\nDuplicate records: {duplicates}")
    
    # Data types
    print(f"\nData types:")
    print(df.dtypes.value_counts())

# Example: Process the AI works we fetched earlier
if 'ai_works' in locals() and len(ai_works) > 0:
    print("Processing AI works data...")
    # Convert to the raw format expected by the processing function
    ai_works_raw = []
    for _, row in ai_works.iterrows():
        # Create a mock raw work structure for processing
        raw_work = {
            "id": f"https://openalex.org/{row['id']}",
            "title": row["title"],
            "publication_year": row["publication_year"],
            "cited_by_count": row["cited_by_count"],
            "open_access": {"is_oa": row["is_open_access"]},
            "type": row["type"]
        }
        ai_works_raw.append(raw_work)
    
    processed_ai_works = process_works_data(ai_works_raw)
    analyze_data_quality(processed_ai_works, "AI works")
else:
    print("No AI works data available for processing. Run the fetch_works cell first.")

## 9. Export Data to Different Formats

In [None]:
import os
from datetime import datetime

def export_data(df: pd.DataFrame, entity_type: str, base_filename: str = None) -> Dict[str, str]:
    """
    Export DataFrame to multiple formats (CSV, JSON, Excel).
    
    Args:
        df: DataFrame to export
        entity_type: Type of entities ('works', 'authors', 'institutions')
        base_filename: Base filename (will add timestamp if None)
    
    Returns:
        Dictionary mapping format to filepath
    """
    if base_filename is None:
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        base_filename = f"openalex_{entity_type}_{timestamp}"
    
    # Create data directory if it doesn't exist
    data_dir = "/home/dnhoa/IAAIR/data"
    os.makedirs(data_dir, exist_ok=True)
    
    exported_files = {}
    
    # Export to CSV
    csv_path = os.path.join(data_dir, f"{base_filename}.csv")
    df.to_csv(csv_path, index=False)
    exported_files["csv"] = csv_path
    print(f"✅ Exported to CSV: {csv_path}")
    
    # Export to JSON
    json_path = os.path.join(data_dir, f"{base_filename}.json")
    df.to_json(json_path, orient="records", indent=2)
    exported_files["json"] = json_path
    print(f"✅ Exported to JSON: {json_path}")
    
    # Export to Excel (if openpyxl is available)
    try:
        excel_path = os.path.join(data_dir, f"{base_filename}.xlsx")
        df.to_excel(excel_path, index=False, engine='openpyxl')
        exported_files["excel"] = excel_path
        print(f"✅ Exported to Excel: {excel_path}")
    except ImportError:
        print("⚠️  Excel export skipped (install openpyxl for Excel support)")
    
    return exported_files

def create_data_summary(datasets: Dict[str, pd.DataFrame]) -> pd.DataFrame:
    """
    Create a summary of all collected datasets.
    
    Args:
        datasets: Dictionary mapping dataset name to DataFrame
        
    Returns:
        Summary DataFrame
    """
    summary_data = []
    
    for name, df in datasets.items():
        summary = {
            "dataset": name,
            "records": len(df),
            "columns": len(df.columns),
            "memory_usage_mb": round(df.memory_usage(deep=True).sum() / 1024 / 1024, 2),
            "date_range": "",
            "key_columns": list(df.columns[:5])  # First 5 columns
        }
        
        # Try to get date range if publication_year exists
        if "publication_year" in df.columns:
            years = df["publication_year"].dropna()
            if len(years) > 0:
                summary["date_range"] = f"{years.min()}-{years.max()}"
        
        summary_data.append(summary)
    
    return pd.DataFrame(summary_data)

# Example: Export all the data we've collected
datasets_to_export = {}

# Add datasets if they exist
if 'ai_works' in locals():
    datasets_to_export['ai_works'] = ai_works
if 'cs_authors' in locals():
    datasets_to_export['cs_authors'] = cs_authors
if 'top_universities' in locals():
    datasets_to_export['top_universities'] = top_universities

print("Exporting collected datasets...")
exported_files_summary = {}

for dataset_name, df in datasets_to_export.items():
    print(f"\nExporting {dataset_name} ({len(df)} records)...")
    files = export_data(df, dataset_name)
    exported_files_summary[dataset_name] = files

# Create and export summary
if datasets_to_export:
    print("\nCreating data summary...")
    summary_df = create_data_summary(datasets_to_export)
    summary_files = export_data(summary_df, "summary", "data_collection_summary")
    
    print("\n=== Data Collection Summary ===")
    print(summary_df.to_string(index=False))
    
    print(f"\n=== All Exported Files ===")
    for dataset, files in exported_files_summary.items():
        print(f"{dataset}:")
        for format_type, filepath in files.items():
            print(f"  {format_type.upper()}: {filepath}")
else:
    print("No datasets available for export. Run the data fetching cells first.")

## 10. Advanced Queries and Use Cases

Here are some advanced query examples for specific research tasks:

In [None]:
# Advanced Query Examples

# 1. COVID-19 research impact analysis
def analyze_covid_research():
    """Analyze COVID-19 research output and impact"""
    covid_params = {
        "filter": "concepts.id:C2778407487,publication_year:>2019",  # COVID-19 concept
        "per-page": 50
    }
    
    response = make_openalex_request("works", covid_params)
    if response:
        print(f"COVID-19 papers since 2020: {response['meta']['count']:,}")
        
        # Analyze top papers by citations
        papers = response["results"]
        top_papers = sorted(papers, key=lambda x: x.get("cited_by_count", 0), reverse=True)[:5]
        
        print("\nTop 5 most cited COVID-19 papers:")
        for i, paper in enumerate(top_papers, 1):
            title = paper.get("title", "No title")[:60] + "..."
            citations = paper.get("cited_by_count", 0)
            year = paper.get("publication_year")
            print(f"{i}. {title} ({year}) - {citations:,} citations")

# 2. Collaboration network analysis
def analyze_international_collaboration():
    """Analyze international collaboration patterns"""
    collab_params = {
        "filter": "authorships.institutions.country_code:US,authorships.institutions.country_code:!US,publication_year:2023",
        "per-page": 25
    }
    
    response = make_openalex_request("works", collab_params)
    if response:
        print(f"US international collaborations in 2023: {response['meta']['count']:,}")

# 3. Open Science analysis
def analyze_open_science_trends():
    """Analyze open access and open science trends"""
    # Compare OA vs non-OA papers by citation impact
    oa_params = {
        "filter": "is_oa:true,publication_year:2022",
        "per-page": 100
    }
    
    non_oa_params = {
        "filter": "is_oa:false,publication_year:2022", 
        "per-page": 100
    }
    
    oa_response = make_openalex_request("works", oa_params)
    non_oa_response = make_openalex_request("works", non_oa_params)
    
    if oa_response and non_oa_response:
        oa_citations = [w.get("cited_by_count", 0) for w in oa_response["results"]]
        non_oa_citations = [w.get("cited_by_count", 0) for w in non_oa_response["results"]]
        
        print("Open Access vs Non-Open Access Citation Analysis (2022):")
        print(f"OA papers average citations: {sum(oa_citations)/len(oa_citations):.2f}")
        print(f"Non-OA papers average citations: {sum(non_oa_citations)/len(non_oa_citations):.2f}")

# 4. Emerging research topics
def find_emerging_topics():
    """Find rapidly growing research areas"""
    # Look for concepts with high growth in recent years
    concepts_params = {
        "filter": "works_count:>1000",
        "sort": "works_count:desc",
        "per-page": 20
    }
    
    response = make_openalex_request("concepts", concepts_params)
    if response:
        print("Top 10 research concepts by total works:")
        for i, concept in enumerate(response["results"][:10], 1):
            name = concept.get("display_name", "Unknown")
            count = concept.get("works_count", 0)
            level = concept.get("level", 0)
            print(f"{i}. {name} (Level {level}) - {count:,} works")

# 5. Author productivity analysis
def analyze_author_productivity():
    """Analyze author productivity patterns"""
    productive_authors_params = {
        "filter": "works_count:>100,last_known_institution.country_code:US",
        "sort": "cited_by_count:desc",
        "per-page": 10
    }
    
    response = make_openalex_request("authors", productive_authors_params)
    if response:
        print("Top 10 most cited US-based prolific authors:")
        for i, author in enumerate(response["results"], 1):
            name = author.get("display_name", "Unknown")
            works = author.get("works_count", 0)
            citations = author.get("cited_by_count", 0)
            h_index = author.get("summary_stats", {}).get("h_index", 0)
            affiliation = author.get("last_known_institution", {}).get("display_name", "Unknown")
            
            print(f"{i}. {name} - {works} works, {citations:,} citations, h-index: {h_index}")
            print(f"   Affiliation: {affiliation}")

# Run example analyses
print("=== Advanced OpenAlex Analysis Examples ===\n")

print("1. COVID-19 Research Analysis:")
analyze_covid_research()

print("\n" + "="*50)
print("2. International Collaboration:")
analyze_international_collaboration()

print("\n" + "="*50)
print("3. Open Science Trends:")
analyze_open_science_trends()

print("\n" + "="*50)
print("4. Top Research Concepts:")
find_emerging_topics()

print("\n" + "="*50)
print("5. Author Productivity:")
analyze_author_productivity()

## 11. Next Steps and Resources

### Additional OpenAlex Features to Explore:
- **Concepts**: Hierarchical subject classification system
- **Sources**: Journals, repositories, and other publication venues  
- **Publishers**: Academic publishers and their metadata
- **Funders**: Research funding organizations and grant information

### Best Practices:
1. **Rate Limiting**: Always respect API rate limits (1 req/sec for polite pool)
2. **Email Registration**: Add your email to get higher rate limits
3. **Caching**: Cache results for repeated analysis
4. **Batch Processing**: Use cursor pagination for large datasets
5. **Error Handling**: Implement robust error handling for production use

### Useful Resources:
- [OpenAlex Documentation](https://docs.openalex.org/)
- [API Overview](https://docs.openalex.org/how-to-use-the-api/api-overview)
- [Filter Documentation](https://docs.openalex.org/how-to-use-the-api/get-lists-of-entities/filter-entity-lists)
- [OpenAlex Community](https://groups.google.com/g/openalex-users)

### Integration with Knowledge Fabric:
This data can be directly integrated into your Knowledge Fabric system using the ingestion pipeline in `src/knowledge_fabric/ingestion/`.

## 12. Finding OpenAlex IDs

Here are practical examples of how to discover OpenAlex IDs for different entities: