# German University Patent Analysis

This notebook analyzes patent data from German universities extracted from the EPO DeepTechFinder tool.
The data includes EP patents from all German universities, with additional information about their spin-offs and startups.

## Data Structure
- **Source**: EPO DeepTechFinder (Manual export, May 2025)
- **Scope**: All German universities with EP patents
- **Additional data**: Dealroom spin-off/startup information where available

## Objectives
1. Import and structure patent data for statistical analysis
2. Prepare data for future OPS/PATSTAT integration
3. Create clean DataFrames for advanced analytics
4. Extract patent identifiers for further enrichment

In [12]:
import pandas as pd
import numpy as np
import re
from urllib.parse import urlparse, parse_qs
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', 50)

## 1. Data Import and Initial Exploration

**Objective**: Load the DeepTechFinder export and understand the data structure.

The CSV contains one row per patent application, with university metadata repeated for each patent. Key fields include:
- University information (name, students, spin-outs)
- Patent details (title, EP number, filing year, status, tech field)
- Espacenet links for patent retrieval

In [13]:
# Load the DeepTechFinder data
df_raw = pd.read_csv('EPO_DeepTechFinder_20250513_DE_Uni_Top100.csv', encoding='mac_roman')

print(f"Dataset shape: {df_raw.shape}")
print(f"\nColumns: {list(df_raw.columns)}")
print(f"\nFirst few rows:")
df_raw.head()

Dataset shape: (11118, 10)

Columns: ['University', 'Total_number_of_Spin_outs', 'Spin_outs_List', 'Total_students', 'Total_number_of_applications', 'Application_title', 'Espacenet_link', 'Filing_year', 'Patent_status', 'Technical_field']

First few rows:


Unnamed: 0,University,Total_number_of_Spin_outs,Spin_outs_List,Total_students,Total_number_of_applications,Application_title,Espacenet_link,Filing_year,Patent_status,Technical_field
0,Karlsruhe Institute of Technology,6,"Robodev GmbH, Renumics, Kamedi, emmtrix Techno...",21848,1269,Spring Member.,https://worldwide.espacenet.com/patent/search?...,1/22/80,EP granted,Other
1,Karlsruhe Institute of Technology,6,"Robodev GmbH, Renumics, Kamedi, emmtrix Techno...",21848,1269,Method And Device For Automatically Focusing A...,https://worldwide.espacenet.com/patent/search?...,2/18/80,EP granted,Other
2,Karlsruhe Institute of Technology,6,"Robodev GmbH, Renumics, Kamedi, emmtrix Techno...",21848,1269,Process For Manufacturing Slot-Nozzle Elements...,https://worldwide.espacenet.com/patent/search?...,5/10/80,EP granted,Other
3,Karlsruhe Institute of Technology,6,"Robodev GmbH, Renumics, Kamedi, emmtrix Techno...",21848,1269,Dosimeter For The Detection Of Radon And/Or Th...,https://worldwide.espacenet.com/patent/search?...,5/28/80,EP granted,Other
4,Karlsruhe Institute of Technology,6,"Robodev GmbH, Renumics, Kamedi, emmtrix Techno...",21848,1269,Container Provided With A Lid.,https://worldwide.espacenet.com/patent/search?...,6/26/80,Refused / Withdrawn,Other


In [14]:
# Basic data exploration
print("=== BASIC STATISTICS ===")
print(f"Total patents: {len(df_raw)}")
print(f"Unique universities: {df_raw['University'].nunique()}")
print(f"Date range: {df_raw['Filing_year'].min()} - {df_raw['Filing_year'].max()}")
print(f"\nUnique technical fields: {df_raw['Technical_field'].nunique()}")
print(f"\nPatent status distribution:")
print(df_raw['Patent_status'].value_counts())

=== BASIC STATISTICS ===
Total patents: 11118
Unique universities: 100
Date range: 1/10/02 - 9/9/93

Unique technical fields: 640

Patent status distribution:
Patent_status
EP granted              4907
Refused / Withdrawn     4295
Pending                 1626
EP granted & Unitary     290
Name: count, dtype: int64


## 2. Data Cleaning and Preparation

**Objective**: Extract and standardize key data fields for analysis.

This section performs three critical data cleaning tasks:
1. **EP Number Extraction**: Parse patent numbers from Espacenet URLs for OPS/PATSTAT queries
2. **Filing Date Parsing**: Convert various date formats to standardized years
3. **Spin-out Processing**: Parse comma-separated company lists into structured data

In [None]:
# Extract EP patent numbers from Espacenet links
def extract_ep_number(espacenet_url):
    """
    Extract EP patent number from Espacenet URL
    
    Example: 'https://worldwide.espacenet.com/patent/search?q=EP80100298A' 
    Returns: 'EP80100298A'
    """
    if pd.isna(espacenet_url):
        return None
    
    # Parse URL and extract query parameter
    try:
        parsed = urlparse(espacenet_url)
        query_params = parse_qs(parsed.query)
        if 'q' in query_params:
            return query_params['q'][0]
    except:
        pass
    
    return None

# Extract EP numbers
df['EP_Number'] = df['Espacenet_link'].apply(extract_ep_number)

print(f"Successfully extracted {df['EP_Number'].notna().sum()} EP numbers")
print(f"Sample EP numbers: {df['EP_Number'].dropna().head().tolist()}")

In [None]:
# Clean and standardize filing year
def parse_filing_date(date_str):
    """
    Parse filing date and extract year from various formats
    
    Handles formats like: '1/22/80', '2023-05-15', etc.
    Returns: Integer year (e.g., 1980, 2023)
    """
    if pd.isna(date_str):
        return None
    
    # Handle different date formats
    try:
        # Try to parse as datetime first
        date_obj = pd.to_datetime(date_str, errors='coerce')
        if pd.notna(date_obj):
            return date_obj.year
        
        # If that fails, try to extract year with regex
        year_match = re.search(r'(19|20)\d{2}', str(date_str))
        if year_match:
            return int(year_match.group())
    except:
        pass
    
    return None

# Parse filing years
df['Filing_Year_Clean'] = df['Filing_year'].apply(parse_filing_date)

print(f"Filing year range: {df['Filing_Year_Clean'].min()} - {df['Filing_Year_Clean'].max()}")
print(f"Missing filing years: {df['Filing_Year_Clean'].isna().sum()}")

In [None]:
# Process spin-out information
def parse_spinouts(spinout_str):
    """
    Parse spin-out company list from comma-separated string
    
    Example: 'Robodev GmbH, Renumics, Kamedi' 
    Returns: ['Robodev GmbH', 'Renumics', 'Kamedi']
    """
    if pd.isna(spinout_str) or spinout_str == '':
        return []
    
    # Split by comma and clean up
    spinouts = [s.strip() for s in str(spinout_str).split(',')]
    return [s for s in spinouts if s]

df['Spin_outs_List_Clean'] = df['Spin_outs_List'].apply(parse_spinouts)
df['Num_Spin_outs'] = df['Spin_outs_List_Clean'].apply(len)

print(f"Universities with spin-outs: {(df['Num_Spin_outs'] > 0).sum()}")
print(f"Total spin-outs: {df['Num_Spin_outs'].sum()}")

## 3. University-Level Analysis DataFrame

**Objective**: Create university-level summary statistics for institutional analysis.

This aggregation creates one row per university with:
- **Patent portfolio metrics**: Total applications, grant rates, technology diversity
- **Performance indicators**: Patents per student, filing activity spans
- **Innovation ecosystem**: Spin-out companies and startup connections

This dataset is ideal for:
- University ranking and benchmarking
- Policy analysis and funding decisions
- Innovation ecosystem studies

In [18]:
# Create university-level summary
university_stats = df.groupby('University').agg({
    'Total_number_of_applications': 'first',
    'Total_students': 'first',
    'Total_number_of_Spin_outs': 'first',
    'Num_Spin_outs': 'first',
    'EP_Number': 'count',  # Count of patents in dataset
    'Filing_Year_Clean': ['min', 'max', 'mean'],
    'Patent_status': lambda x: (x == 'EP granted').sum(),  # Count granted patents
    'Technical_field': lambda x: x.nunique()  # Unique tech fields
}).round(1)

# Flatten column names
university_stats.columns = [
    'Total_Applications_DTF', 'Total_Students', 'Total_Spin_outs_DTF', 
    'Num_Spin_outs_Parsed', 'Patents_in_Dataset', 'First_Filing_Year', 
    'Last_Filing_Year', 'Avg_Filing_Year', 'Granted_Patents', 'Unique_Tech_Fields'
]

# Calculate additional metrics
university_stats['Patents_per_1000_Students'] = (
    university_stats['Total_Applications_DTF'] / university_stats['Total_Students'] * 1000
).round(2)

university_stats['Grant_Rate'] = (
    university_stats['Granted_Patents'] / university_stats['Patents_in_Dataset'] * 100
).round(1)

university_stats['Patent_Span_Years'] = (
    university_stats['Last_Filing_Year'] - university_stats['First_Filing_Year']
)

# Reset index and sort by total applications
university_stats = university_stats.reset_index().sort_values(
    'Total_Applications_DTF', ascending=False
)

print(f"University summary created with {len(university_stats)} universities")
print(f"\nTop 10 universities by patent applications:")
university_stats[['University', 'Total_Applications_DTF', 'Patents_per_1000_Students', 
                  'Grant_Rate', 'Unique_Tech_Fields']].head(10)

University summary created with 100 universities

Top 10 universities by patent applications:


Unnamed: 0,University,Total_Applications_DTF,Patents_per_1000_Students,Grant_Rate,Unique_Tech_Fields
34,Karlsruhe Institute of Technology,1269,58.08,63.9,72
60,Technical University of Munich,647,13.75,32.1,89
79,University of Erlangen-Nürnberg,553,14.66,38.0,69
80,University of Freiburg,537,22.43,44.9,73
65,Technische Universität Dresden,492,16.8,53.9,60
25,Heidelberg University,420,14.76,39.5,96
58,Technical University of Berlin,388,11.04,53.4,53
0,Aachen University,351,7.39,39.0,40
31,Johannes Gutenberg University Mainz,333,10.93,44.4,110
96,University of Stuttgart,307,13.57,43.0,34


## 4. Patent-Level DataFrame for Technology Analysis

**Objective**: Create a clean patent-level dataset optimized for technology field analysis.

This dataset provides:
- **Clean patent identifiers**: EP numbers ready for OPS/PATSTAT queries
- **Technology mapping**: Patents linked to classification fields
- **University context**: Each patent tagged with applicant university
- **Spin-out connections**: Patents linked to startup/spin-off companies where available

This structure enables:
- Technology trend analysis by university
- Patent landscape studies
- Innovation transfer research (university → spin-out connections)

In [23]:
# Create simplified patent-level DataFrame for technology analysis
patents_clean = df[['University', 'EP_Number', 'Filing_Year_Clean', 'Patent_status', 'Technical_field']].copy()

# Add spin-out information - extract the first spin-out company if available
def get_primary_spinout(spinout_list):
    """Get the first spin-out company from the list, or None if empty"""
    if len(spinout_list) > 0:
        return spinout_list[0]
    return None

patents_clean['Primary_Spinout'] = df['Spin_outs_List_Clean'].apply(get_primary_spinout)

# Add university metadata (student numbers for context)
university_meta = df.groupby('University')[['Total_students', 'Num_Spin_outs']].first().reset_index()
patents_clean = patents_clean.merge(university_meta, on='University', how='left')

# Clean EP numbers for OPS/PATSTAT queries
patents_clean['EP_Number_Clean'] = patents_clean['EP_Number'].str.replace('EP', '').str.replace('A', '')
patents_clean['EP_Application_ID'] = patents_clean['EP_Number_Clean'].str.extract(r'(\d+)')

# Add analysis flags
patents_clean['Is_Granted'] = patents_clean['Patent_status'] == 'EP granted'
patents_clean['Is_Recent'] = patents_clean['Filing_Year_Clean'] >= 2020
patents_clean['Has_Spinout_Connection'] = patents_clean['Primary_Spinout'].notna()

print(f"Patent-level DataFrame created with {len(patents_clean)} records")
print(f"Patents with spin-out connections: {patents_clean['Has_Spinout_Connection'].sum()}")
print(f"Ready for PATSTAT queries: {patents_clean['EP_Application_ID'].notna().sum()} patents")
print(f"\nSample patent records:")
patents_clean[['University', 'EP_Number', 'Filing_Year_Clean', 'Technical_field', 
               'Primary_Spinout']].head()

Patent-level DataFrame created with 11118 records
Patents with spin-out connections: 5571
Ready for PATSTAT queries: 11118 patents

Sample patent records:


Unnamed: 0,University,EP_Number,Filing_Year_Clean,Technical_field,Primary_Spinout
0,Karlsruhe Institute of Technology,EP80100298A,1980,Other,Robodev GmbH
1,Karlsruhe Institute of Technology,EP80100797A,1980,Other,Robodev GmbH
2,Karlsruhe Institute of Technology,EP80102603A,1980,Other,Robodev GmbH
3,Karlsruhe Institute of Technology,EP80102962A,1980,Other,Robodev GmbH
4,Karlsruhe Institute of Technology,EP80103623A,1980,Other,Robodev GmbH


## 5. Technology Field Analysis

**Objective**: Analyze patent activity across different technology domains.

This analysis provides insights into:
- **Technology popularity**: Which fields are most active in German university research
- **University diversity**: How many institutions work in each technology area
- **Success rates**: Grant rates and commercialization patterns by technology
- **Innovation transfer**: Connection between university research and spin-out formation

Key metrics calculated:
- Patent counts and university participation per field
- Grant rates and temporal trends
- Spin-out formation rates by technology area

In [24]:
# Technology field analysis using the clean patent dataset
tech_analysis = patents_clean.groupby('Technical_field').agg({
    'EP_Number': 'count',
    'University': 'nunique',
    'Is_Granted': 'sum',
    'Filing_Year_Clean': ['min', 'max', 'mean'],
    'Has_Spinout_Connection': 'sum'
}).round(1)

# Flatten column names for clarity
tech_analysis.columns = [
    'Patent_Count', 'University_Count', 'Granted_Count', 
    'First_Year', 'Last_Year', 'Avg_Year', 'With_Spinouts'
]

# Calculate performance rates
tech_analysis['Grant_Rate'] = (
    tech_analysis['Granted_Count'] / tech_analysis['Patent_Count'] * 100
).round(1)

tech_analysis['Spinout_Rate'] = (
    tech_analysis['With_Spinouts'] / tech_analysis['Patent_Count'] * 100
).round(1)

tech_analysis['Activity_Span'] = tech_analysis['Last_Year'] - tech_analysis['First_Year']

# Sort by patent count to identify most active fields
tech_analysis = tech_analysis.sort_values('Patent_Count', ascending=False)

print("Technology field analysis (top 10 most active fields):")
print(tech_analysis[['Patent_Count', 'University_Count', 'Grant_Rate', 'Spinout_Rate']].head(10))

Technology field analysis (top 10 most active fields):
                                                    Patent_Count  \
Technical_field                                                    
Other                                                       8498   
Oncology - treatment: Immunotherapy                          153   
Oncology - cancer models: Cancer models                      144   
Oncology - treatment: Targeted chemotherapy                  125   
Oncology - diagnostics: Biopsies                             122   
Oncology - diagnostics: Imaging                               72   
Clean energy - storage: Lithium and lithium-ion               66   
Smart industry - enabling: User interface                     64   
Oncology - treatment: Gene therapy,Oncology - t...            55   
Oncology - diagnostics: Personalised medicine                 53   

                                                    University_Count  \
Technical_field                                         

## 6. Data Export for Further Analysis

**Objective**: Create two clean, focused datasets for downstream analysis.

**Export Strategy**:
1. **University Summary** (`university_summary.csv`): One row per university with all institutional metrics
2. **Patent Technology List** (`patent_technology_list.csv`): One row per patent with university, technology field, and spin-out connections

This approach provides:
- **Focused datasets**: Each file serves a specific analytical purpose
- **Reduced complexity**: Eliminates redundant information and oversized files
- **Clear relationships**: Universities ↔ Patents ↔ Technologies ↔ Spin-outs

In [25]:
import os

# Create output directory if it doesn't exist
output_dir = './output'
os.makedirs(output_dir, exist_ok=True)

# 1. University Summary Dataset - One row per university
university_summary = university_stats[['University', 'Total_Applications_DTF', 'Total_Students', 
                                     'Total_Spin_outs_DTF', 'Patents_in_Dataset', 'First_Filing_Year',
                                     'Last_Filing_Year', 'Avg_Filing_Year', 'Granted_Patents', 
                                     'Unique_Tech_Fields', 'Patents_per_1000_Students', 
                                     'Grant_Rate', 'Patent_Span_Years']].copy()

university_summary.to_csv(os.path.join(output_dir, 'university_summary.csv'), index=False)

# 2. Patent Technology List - Clean patent dataset with key information
patent_tech_list = patents_clean[['EP_Number', 'University', 'Technical_field', 
                                'Filing_Year_Clean', 'Patent_status', 'Primary_Spinout',
                                'EP_Application_ID']].copy()

# Rename columns for clarity
patent_tech_list = patent_tech_list.rename(columns={
    'EP_Number': 'EP_Patent_Number',
    'Filing_Year_Clean': 'Filing_Year',
    'Primary_Spinout': 'Connected_Spinout',
    'EP_Application_ID': 'EP_App_ID_for_PATSTAT'
})

patent_tech_list.to_csv(os.path.join(output_dir, 'patent_technology_list.csv'), index=False)

# 3. Technology Field Summary for reference
tech_summary = tech_analysis[['Patent_Count', 'University_Count', 'Grant_Rate', 
                            'Spinout_Rate', 'Activity_Span']].copy()
tech_summary.to_csv(os.path.join(output_dir, 'technology_field_summary.csv'), index=True)

# 4. EP Numbers Only for OPS queries
ep_numbers_only = patents_clean['EP_Number'].dropna().unique()
pd.DataFrame(ep_numbers_only, columns=['EP_Number']).to_csv(
    os.path.join(output_dir, 'ep_numbers_for_ops.csv'), index=False)

# Remove the old large file if it exists
old_file = os.path.join(output_dir, 'patents_for_ops_patstat.csv')
if os.path.exists(old_file):
    os.remove(old_file)
    print(f"Removed oversized file: {old_file}")

print(f"Exported clean datasets to {output_dir}/:")
print(f"- university_summary.csv: {len(university_summary)} universities with complete metrics")
print(f"- patent_technology_list.csv: {len(patent_tech_list)} patents with technology & spin-out info")
print(f"- technology_field_summary.csv: {len(tech_summary)} technology fields with performance metrics")
print(f"- ep_numbers_for_ops.csv: {len(ep_numbers_only)} unique EP numbers for API queries")

Exported clean datasets to ./output/:
- university_summary.csv: 100 universities with complete metrics
- patent_technology_list.csv: 11118 patents with technology & spin-out info
- technology_field_summary.csv: 640 technology fields with performance metrics
- ep_numbers_for_ops.csv: 11054 unique EP numbers for API queries


## 7. Summary and Next Steps

**Objective**: Provide analysis summary and guidance for future work.

This notebook has successfully processed the German university patent data into clean, analysis-ready formats. The structured datasets enable multiple research directions and integration with external patent databases.

In [26]:
print("=== GERMAN UNIVERSITY PATENT ANALYSIS SUMMARY ===")
print(f"\nDATASET OVERVIEW:")
print(f"- Total patents analyzed: {len(patents_clean):,}")
print(f"- Universities covered: {len(university_summary):,}")
print(f"- Time span: {patents_clean['Filing_Year_Clean'].min():.0f} - {patents_clean['Filing_Year_Clean'].max():.0f}")
print(f"- Technology fields: {len(tech_analysis):,}")

print(f"\nPATENT CHARACTERISTICS:")
print(f"- Granted patents: {patents_clean['Is_Granted'].sum():,} ({patents_clean['Is_Granted'].mean()*100:.1f}%)")
print(f"- Recent patents (2020+): {patents_clean['Is_Recent'].sum():,} ({patents_clean['Is_Recent'].mean()*100:.1f}%)")
print(f"- With spin-out connections: {patents_clean['Has_Spinout_Connection'].sum():,} ({patents_clean['Has_Spinout_Connection'].mean()*100:.1f}%)")

print(f"\nTOP PERFORMERS:")
top_by_patents = university_summary.iloc[0]
top_by_ratio = university_summary.sort_values('Patents_per_1000_Students', ascending=False).iloc[0]
print(f"- Most patents: {top_by_patents['University']} ({top_by_patents['Total_Applications_DTF']:,} patents)")
print(f"- Highest ratio: {top_by_ratio['University']} ({top_by_ratio['Patents_per_1000_Students']:.1f} per 1000 students)")

print(f"\nEXPORTED DATASETS:")
print(f"- university_summary.csv: Institutional metrics for {len(university_summary)} universities")
print(f"- patent_technology_list.csv: Technology mapping for {len(patent_tech_list):,} patents")
print(f"- technology_field_summary.csv: Performance analysis across {len(tech_summary)} fields")
print(f"- ep_numbers_for_ops.csv: {len(ep_numbers_only):,} unique EP numbers for API queries")

print(f"\nRECOMMENDED NEXT STEPS:")
print(f"1. University Analysis: Use './output/university_summary.csv' for institutional benchmarking")
print(f"2. Technology Trends: Analyze './output/patent_technology_list.csv' for field-specific patterns")
print(f"3. Innovation Transfer: Study university-spinout connections in patent data")
print(f"4. Data Enrichment: Use './output/ep_numbers_for_ops.csv' for EPO OPS API queries")
print(f"5. PATSTAT Integration: Use EP_App_ID_for_PATSTAT column for database queries")

print(f"\nKEY INSIGHTS:")
print(f"- German universities show diverse patent portfolios across {len(tech_analysis)} technology fields")
print(f"- Innovation transfer is active: {patents_clean['Has_Spinout_Connection'].mean()*100:.1f}% of patents have spin-out connections")
print(f"- Recent activity strong: {patents_clean['Is_Recent'].mean()*100:.1f}% of patents filed since 2020")

=== GERMAN UNIVERSITY PATENT ANALYSIS SUMMARY ===

DATASET OVERVIEW:
- Total patents analyzed: 11,118
- Universities covered: 100
- Time span: 1980 - 2024
- Technology fields: 640

PATENT CHARACTERISTICS:
- Granted patents: 4,907 (44.1%)
- Recent patents (2020+): 2,036 (18.3%)
- With spin-out connections: 5,571 (50.1%)

TOP PERFORMERS:
- Most patents: Karlsruhe Institute of Technology (1,269 patents)
- Highest ratio: Hannover Medical School (61.1 per 1000 students)

EXPORTED DATASETS:
- university_summary.csv: Institutional metrics for 100 universities
- patent_technology_list.csv: Technology mapping for 11,118 patents
- technology_field_summary.csv: Performance analysis across 640 fields
- ep_numbers_for_ops.csv: 11,054 unique EP numbers for API queries

RECOMMENDED NEXT STEPS:
1. University Analysis: Use './output/university_summary.csv' for institutional benchmarking
2. Technology Trends: Analyze './output/patent_technology_list.csv' for field-specific patterns
3. Innovation Transfe