# Mercedes F1 Infringement Document Filter

This notebook extracts Mercedes-specific infringement documents from the FIA PDF collection and converts them to text files for further processing.

## Objective
- Parse all PDF files in the Documents folder
- Filter for documents addressed to "Mercedes-AMG PETRONAS F1 Team"
- Extract text content from filtered PDFs
- Save as individual .txt files organized by year


In [2]:
pip install PyPDF2

Collecting PyPDF2
  Using cached pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Using cached pypdf2-3.0.1-py3-none-any.whl (232 kB)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Note: you may need to restart the kernel to use updated packages.


In [14]:
# Import required libraries
import os
import re
from pathlib import Path
import PyPDF2
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')


In [15]:
# Define paths
base_path = Path("Documents")
years = ["2020-infirdgement_profile", "2021-infridgement_profile", 
         "2022-infridgement_profile", "2023-infridgement_profile"]

print("Available year folders:")
for year in years:
    year_path = base_path / year
    if year_path.exists():
        pdf_count = len(list(year_path.glob("*.pdf")))
        print(f"  {year}: {pdf_count} PDF files")


Available year folders:
  2020-infirdgement_profile: 131 PDF files
  2021-infridgement_profile: 175 PDF files
  2022-infridgement_profile: 227 PDF files
  2023-infridgement_profile: 194 PDF files


In [16]:
# Function to extract text from PDF
def extract_pdf_text(pdf_path):
    """
    Extract text from PDF file
    Returns the full text content
    """
    try:
        with open(pdf_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            text = ""
            
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\n"
            
            return text
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
        return None


In [17]:
# Updated function to check if document is addressed to Mercedes (includes car numbers)
def is_mercedes_document_updated(text):
    """
    Check if the document is addressed to Mercedes team or involves Mercedes drivers
    Look for the specific header pattern and car numbers
    """
    if not text:
        return False
    
    # Convert to lowercase for case-insensitive matching
    text_lower = text.lower()
    
    # Look for Mercedes team references in the header section
    mercedes_patterns = [
        r'to:\s*the team manager,?\s*mercedes[-\s]*amg\s*petronas\s*f1\s*team',
        r'mercedes[-\s]*amg\s*petronas\s*f1\s*team',
        r'mercedes\s*petronas\s*f1\s*team',
        r'mercedes\s*amg\s*petronas'
    ]
    
    # Mercedes car numbers by year:
    # 2020-2021: Car 44 (Hamilton), Car 77 (Bottas)
    # 2022-2023: Car 44 (Hamilton), Car 63 (Russell)
    # Also Car 42 (Bottas in some contexts)
    mercedes_car_numbers = [44, 77, 63, 42]
    
    # Check first 2000 characters (header section)
    header_section = text_lower[:2000]
    
    # Check for Mercedes team references
    for pattern in mercedes_patterns:
        if re.search(pattern, header_section, re.IGNORECASE):
            return True
    
    # Check for Mercedes car numbers in document title/header
    # Look for patterns like "Car 44", "Car 77", etc.
    for car_num in mercedes_car_numbers:
        car_pattern = rf'\bcar\s*{car_num}\b'
        if re.search(car_pattern, header_section, re.IGNORECASE):
            return True
    
    return False

# Use the updated function instead of the original
is_mercedes_document = is_mercedes_document_updated


In [18]:
# Function to check if document is addressed to Mercedes
def is_mercedes_document(text):
    """
    Check if the document is addressed to Mercedes team
    Look for the specific header pattern
    """
    if not text:
        return False
    
    # Convert to lowercase for case-insensitive matching
    text_lower = text.lower()
    
    # Look for Mercedes team references in the header section
    mercedes_patterns = [
        r'to:\s*the team manager,?\s*mercedes[-\s]*amg\s*petronas\s*f1\s*team',
        r'mercedes[-\s]*amg\s*petronas\s*f1\s*team',
        r'mercedes\s*petronas\s*f1\s*team',
        r'mercedes\s*amg\s*petronas'
    ]
    
    # Check first 2000 characters (header section)
    header_section = text_lower[:2000]
    
    for pattern in mercedes_patterns:
        if re.search(pattern, header_section, re.IGNORECASE):
            return True
    
    return False


In [19]:
# Function to clean and save text content
def clean_text(text):
    """
    Clean extracted text
    """
    if not text:
        return ""
    
    # Remove excessive whitespace and normalize
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    
    return text


In [20]:
# Main processing function
def process_year_folder(year_folder):
    """
    Process all PDFs in a year folder and extract Mercedes documents
    """
    year_path = base_path / year_folder
    if not year_path.exists():
        print(f"Folder {year_path} does not exist")
        return
    
    pdf_files = list(year_path.glob("*.pdf"))
    print(f"\nProcessing {len(pdf_files)} PDF files in {year_folder}...")
    
    mercedes_docs = []
    
    for pdf_file in tqdm(pdf_files, desc="Processing PDFs"):
        # Extract text from PDF
        text = extract_pdf_text(pdf_file)
        
        if text and is_mercedes_document(text):
            # Clean the text
            cleaned_text = clean_text(text)
            
            # Create output filename
            txt_filename = pdf_file.stem + ".txt"
            txt_path = year_path / txt_filename
            
            # Save as text file
            try:
                with open(txt_path, 'w', encoding='utf-8') as f:
                    f.write(cleaned_text)
                
                mercedes_docs.append({
                    'pdf_file': pdf_file.name,
                    'txt_file': txt_filename,
                    'year': year_folder,
                    'text_length': len(cleaned_text)
                })
                
                print(f"✓ Found Mercedes document: {pdf_file.name}")
                
            except Exception as e:
                print(f"Error saving {txt_filename}: {e}")
    
    return mercedes_docs


In [21]:
# Process all year folders
all_mercedes_docs = []

for year_folder in years:
    mercedes_docs = process_year_folder(year_folder)
    if mercedes_docs:
        all_mercedes_docs.extend(mercedes_docs)
        print(f"Found {len(mercedes_docs)} Mercedes documents in {year_folder}")
    else:
        print(f"No Mercedes documents found in {year_folder}")



Processing 131 PDF files in 2020-infirdgement_profile...


Processing PDFs:   5%|▌         | 7/131 [00:00<00:10, 11.42it/s]

✓ Found Mercedes document: 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.pdf


Processing PDFs:  12%|█▏        | 16/131 [00:01<00:06, 18.63it/s]

✓ Found Mercedes document: 2020 Austrian Grand Prix - Decision - review of decision (document 33).pdf
✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - Failure to slow for yellow flags (post review).pdf
✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - incident with car 23.pdf
✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - Leaving the track in turn 10.pdf
✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - Track Limits turn 10.pdf


Processing PDFs:  57%|█████▋    | 75/131 [00:01<00:01, 55.30it/s]

✓ Found Mercedes document: 2020 Italian Grand Prix - Offence - Car 44 - Entering closed pit lane.pdf


Processing PDFs:  67%|██████▋   | 88/131 [00:02<00:00, 56.92it/s]

✓ Found Mercedes document: 2020 Russian Grand Prix - Decision - Car 44 - Turn 2 .pdf
✓ Found Mercedes document: 2020 Russian Grand Prix - Offence - Car 44 - 1st Practice start.pdf
✓ Found Mercedes document: 2020 Russian Grand Prix - Offence - Car 44 - 2nd Practice start .pdf


Processing PDFs:  78%|███████▊  | 102/131 [00:02<00:00, 59.21it/s]

✓ Found Mercedes document: 2020 Russian Grand Prix - Replacement for Document 46 - Offence - Car 44 - 1st Practice Start.pdf
✓ Found Mercedes document: 2020 Russian Grand Prix - Replacement for Document 47 - Offence - Car 44 - 2nd Practice Start.pdf
✓ Found Mercedes document: 2020 Sakhir Grand Prix - Offence - Mercedes - Car 63 incorrect use of tyres.pdf


Processing PDFs: 100%|██████████| 131/131 [00:02<00:00, 44.95it/s]


Found 13 Mercedes documents in 2020-infirdgement_profile

Processing 175 PDF files in 2021-infridgement_profile...


Processing PDFs:   3%|▎         | 5/175 [00:00<00:03, 48.15it/s]

✓ Found Mercedes document: 2021 Abu Dhabi Grand Prix - Decision - Mercedes Protest Art. 48.8.pdf


Processing PDFs:   7%|▋         | 12/175 [00:00<00:02, 57.44it/s]

✓ Found Mercedes document: 2021 Austrian Grand Prix - Decision - Car 77 - Alleged driving unnecessarily slowly .pdf


Processing PDFs:  34%|███▍      | 60/175 [00:01<00:01, 59.04it/s]

✓ Found Mercedes document: 2021 Brazilian Grand Prix - Offence - Car 44 - DRS.pdf
✓ Found Mercedes document: 2021 Brazilian Grand Prix - Offence - Car 44 - PU element.pdf
✓ Found Mercedes document: 2021 Brazilian Grand Prix - Offence - Car 44 - Safety Belts.pdf
✓ Found Mercedes document: 2021 British Grand Prix - Offence - Car 44 - Causing a collision with car 33.pdf


Processing PDFs:  58%|█████▊    | 102/175 [00:01<00:01, 61.54it/s]

✓ Found Mercedes document: 2021 Hungarian Grand Prix - Offence - Car 77 - causing a collision.pdf
✓ Found Mercedes document: 2021 Hungarian Grand Prix - Offence - Car 77 - Pre-Race procedure.pdf


Processing PDFs:  66%|██████▋   | 116/175 [00:01<00:00, 63.63it/s]

✓ Found Mercedes document: 2021 Italian Grand Prix - Offence - Car 77 - PU element.pdf
✓ Found Mercedes document: 2021 Italian Grand Prix - Offence - Car 77 - PU elements.pdf
✓ Found Mercedes document: 2021 Mexican Grand Prix - Offence - Car 44 - Turn 2 .pdf


Processing PDFs:  83%|████████▎ | 145/175 [00:02<00:00, 64.54it/s]

✓ Found Mercedes document: 2021 Qatar Grand Prix - Offence - Car 77 - Single waved yellow flag.pdf
✓ Found Mercedes document: 2021 Russian Grand Prix - Offence - Car 77 - PU elements.pdf


Processing PDFs:  91%|█████████ | 159/175 [00:02<00:00, 63.67it/s]

✓ Found Mercedes document: 2021 Saudi Arabian Grand Prix - Decision - Car 44 - double yellow.pdf
✓ Found Mercedes document: 2021 Saudi Arabian Grand Prix - Offence - Car 44 - Impeding.pdf


Processing PDFs:  99%|█████████▉| 173/175 [00:02<00:00, 62.34it/s]

✓ Found Mercedes document: 2021 Turkish Grand Prix - Offence - Car 44 - PU element.pdf
✓ Found Mercedes document: 2021 United States Grand Prix - Offence - Car 77 - PU elements.pdf


Processing PDFs: 100%|██████████| 175/175 [00:03<00:00, 55.20it/s]


Found 17 Mercedes documents in 2021-infridgement_profile

Processing 227 PDF files in 2022-infridgement_profile...


Processing PDFs:   1%|▏         | 3/227 [00:00<00:22,  9.96it/s]

✓ Found Mercedes document: 2022 Abu Dhabi Grand Prix - Decision - Car 44 - Red Flag_0.pdf
✓ Found Mercedes document: 2022 Abu Dhabi Grand Prix - Offence - Car 44 - Pit lane speeding.pdf
✓ Found Mercedes document: 2022 Abu Dhabi Grand Prix - Offence - Car 63 - Unsafe release.pdf


Processing PDFs:   9%|▉         | 20/227 [00:01<00:10, 20.36it/s]

✓ Found Mercedes document: 2022 Australian Grand Prix - Decision - Car 44 - Alleged impeding of Car 18 at turn 13.pdf
✓ Found Mercedes document: 2022 Austrian Grand Prix - Decision - Team Radio Communication - Formation Lap.pdf


Processing PDFs:  18%|█▊        | 40/227 [00:02<00:04, 42.21it/s]

✓ Found Mercedes document: 2022 Austrian Grand Prix - Offence - Car 44 - Parc Ferme Instructions.pdf
✓ Found Mercedes document: 2022 Austrian Grand Prix - Offence - Car 63 - Causing a collision.pdf
✓ Found Mercedes document: 2022 Austrian Grand Prix - Offence - Car 63 - Entered the track on foot.pdf


Processing PDFs:  23%|██▎       | 53/227 [00:02<00:03, 46.76it/s]

✓ Found Mercedes document: 2022 Azerbaijan Grand Prix - Decision - Car 44 - Allegedly driving unnecessarily slowly during Qualifying.pdf


Processing PDFs:  33%|███▎      | 75/227 [00:02<00:02, 57.90it/s]

✓ Found Mercedes document: 2022 Belgian Grand Prix - Offence - Car 44 - Alleged causing a collision.pdf
✓ Found Mercedes document: 2022 Belgian Grand Prix - Offence - Car 44 - Refusal to visit Medical Centre.pdf


Processing PDFs:  52%|█████▏    | 118/227 [00:03<00:01, 65.19it/s]

✓ Found Mercedes document: 2022 Dutch Grand Prix - Offence - Car 44 - Alleged impeding of Car 55.pdf


Processing PDFs:  61%|██████    | 139/227 [00:03<00:01, 65.76it/s]

✓ Found Mercedes document: 2022 Italian Grand Prix - Offence - Car 44 - PU element.pdf


Processing PDFs:  89%|████████▊ | 201/227 [00:04<00:00, 54.18it/s]

✓ Found Mercedes document: 2022 Singapore Grand Prix - Decision - Car 44 - Breach of Appendix L.pdf
✓ Found Mercedes document: 2022 Singapore Grand Prix - Offence - Car 63 - Pit lane speeding.pdf
✓ Found Mercedes document: 2022 Singapore Grand Prix - Offence - Car 63 - PU elements.pdf


Processing PDFs: 100%|██████████| 227/227 [00:05<00:00, 42.87it/s]


✓ Found Mercedes document: 2022 United States Grand Prix - Offence - Car 63 - T1 Incident with car 55.pdf
Found 17 Mercedes documents in 2022-infridgement_profile

Processing 194 PDF files in 2023-infridgement_profile...


Processing PDFs:   3%|▎         | 6/194 [00:00<00:03, 57.88it/s]

✓ Found Mercedes document: 2023 Abu Dhabi Grand Prix - Infringement - Mercedes - Team Principal (Updated).pdf


Processing PDFs:  10%|▉         | 19/194 [00:00<00:04, 40.63it/s]

✓ Found Mercedes document: 2023 Australian Grand Prix - Decision - Mercedes - Inaccurate Self Scrutineering Form.pdf


Processing PDFs:  17%|█▋        | 33/194 [00:00<00:03, 53.33it/s]

✓ Found Mercedes document: 2023 Austrian Grand Prix - Infringement - Car 44 - Leaving the track multiple times.pdf
✓ Found Mercedes document: 2023 Austrian Grand Prix - Infringement - Car 44 - Pit Lane Speeding.pdf


Processing PDFs:  28%|██▊       | 54/194 [00:01<00:02, 52.56it/s]

✓ Found Mercedes document: 2023 Bahrain Grand Prix - Decision - Car 44 - Wearing of Jewellery.pdf


Processing PDFs:  35%|███▌      | 68/194 [00:01<00:02, 57.06it/s]

✓ Found Mercedes document: 2023 Belgian Grand Prix - Infringement - Car 44 - Causing a Collision.pdf
✓ Found Mercedes document: 2023 British Grand Prix - Infringement - Mercedes - Thursday Press Conference.pdf
✓ Found Mercedes document: 2023 Canadian Grand Prix - Decision - Car 44 - Alleged Unsafe Release.pdf


Processing PDFs:  53%|█████▎    | 103/194 [00:01<00:01, 61.92it/s]

✓ Found Mercedes document: 2023 Italian Grand Prix - Infringement - Car 44 - Causing a Collision.pdf
✓ Found Mercedes document: 2023 Italian Grand Prix - Infringement - Car 63 - Leaving the track.pdf
✓ Found Mercedes document: 2023 Las Vegas Grand Prix - Infringement - Car 63 - Causing a collision.pdf


Processing PDFs:  64%|██████▍   | 124/194 [00:02<00:01, 62.86it/s]

✓ Found Mercedes document: 2023 Miami Grand Prix - Decision - Car 44 - Turn 17 Incident.pdf


Processing PDFs:  72%|███████▏  | 140/194 [00:02<00:00, 67.50it/s]

✓ Found Mercedes document: 2023 Monaco Grand Prix - Infringement - Car 44 - Pit Lane Speeding.pdf
✓ Found Mercedes document: 2023 Monaco Grand Prix - Infringement - Car 63 - Pit Lane Speeding.pdf
✓ Found Mercedes document: 2023 Monaco Grand Prix - Infringement - Car 63 - Unsafe Rejoin.pdf


Processing PDFs:  85%|████████▌ | 165/194 [00:03<00:00, 53.41it/s]

✓ Found Mercedes document: 2023 Qatar Grand Prix - Infringement - Car 44 - Crossing the track.pdf
✓ Found Mercedes document: 2023 Saudi Arabian Grand Prix - Offence - Mercedes - Inaccurate Scrutineering Form.pdf


Processing PDFs:  95%|█████████▍| 184/194 [00:03<00:00, 41.93it/s]

✓ Found Mercedes document: 2023 Spanish Grand Prix - Infringement - Car 63 - Abnormal change of direction.pdf
✓ Found Mercedes document: 2023 Spanish Grand Prix - Infringement - Mercedes - Parc Ferme.pdf
✓ Found Mercedes document: 2023 São Paulo Grand Prix - Infringement - Car 63 - Impeding at Pit Exit.pdf


Processing PDFs: 100%|██████████| 194/194 [00:03<00:00, 51.09it/s]

✓ Found Mercedes document: 2023 United States Grand Prix - Infringement - Car 44 - Technical non-compliance (Plank).pdf
✓ Found Mercedes document: 2023 United States Grand Prix - Infringement - Car 63 - Impeding of Car 16.pdf
✓ Found Mercedes document: 2023 United States Grand Prix - Infringement - Car 63 - Leaving the track.pdf
Found 23 Mercedes documents in 2023-infridgement_profile





In [22]:
# Summary statistics
if all_mercedes_docs:
    df_mercedes = pd.DataFrame(all_mercedes_docs)
    
    print("\n" + "="*50)
    print("MERCEDES DOCUMENT EXTRACTION SUMMARY")
    print("="*50)
    
    print(f"Total Mercedes documents found: {len(all_mercedes_docs)}")
    print(f"Total text length: {sum([doc['text_length'] for doc in all_mercedes_docs]):,} characters")
    
    print("\nDocuments by year:")
    year_counts = df_mercedes['year'].value_counts().sort_index()
    for year, count in year_counts.items():
        print(f"  {year}: {count} documents")
    
    print("\nSample documents:")
    for _, row in df_mercedes.head().iterrows():
        print(f"  - {row['pdf_file']} ({row['text_length']} chars)")
    
    # Save summary
    df_mercedes.to_csv('mercedes_documents_summary.csv', index=False)
    print("\nSummary saved to: mercedes_documents_summary.csv")
    
else:
    print("\nNo Mercedes documents found. Please check the filtering criteria.")



MERCEDES DOCUMENT EXTRACTION SUMMARY
Total Mercedes documents found: 70
Total text length: 114,921 characters

Documents by year:
  2020-infirdgement_profile: 13 documents
  2021-infridgement_profile: 17 documents
  2022-infridgement_profile: 17 documents
  2023-infridgement_profile: 23 documents

Sample documents:
  - 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.pdf (1438 chars)
  - 2020 Austrian Grand Prix - Decision - review of decision (document 33).pdf (1124 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - Failure to slow for yellow flags (post review).pdf (1632 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - incident with car 23.pdf (1418 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - Leaving the track in turn 10.pdf (1623 chars)

Summary saved to: mercedes_documents_summary.csv


In [23]:
# Display sample of extracted text for verification
if all_mercedes_docs:
    print("\n" + "="*50)
    print("SAMPLE EXTRACTED TEXT")
    print("="*50)
    
    # Get first Mercedes document
    first_doc = all_mercedes_docs[0]
    year_path = base_path / first_doc['year']
    txt_path = year_path / first_doc['txt_file']
    
    if txt_path.exists():
        with open(txt_path, 'r', encoding='utf-8') as f:
            sample_text = f.read()
        
        print(f"Document: {first_doc['pdf_file']}")
        print(f"Year: {first_doc['year']}")
        print(f"Length: {len(sample_text)} characters")
        print("\nFirst 1000 characters:")
        print("-" * 50)
        print(sample_text[:1000] + "..." if len(sample_text) > 1000 else sample_text)



SAMPLE EXTRACTED TEXT
Document: 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.pdf
Year: 2020-infirdgement_profile
Length: 1438 characters

First 1000 characters:
--------------------------------------------------
From The Stewards To The Team Manager, Mercedes-AMG Petronas F1 TeamDocument 33 Date 04 July 2020 Time 19:44 2020 AUSTRIAN GRAND PRIX 2 - 5 July 2020 The StewardsThe Stewards, having received a report from the Race Director, summoned (document 29) and heard from the driver and team representative, have considered the following matter and determine the following: No / Driver 44 - Lewis Hamilton Competitor Mercedes-AMG Petronas F1 Team Time 15:59 Session Qualifying Fact Alleged failure to slow for single waved yellow flags between turn 5 and 7. Offence Alleged breach of Appendix H Article 2.5.5.1.b) of the FIA International Sporting Code. Decision No further action. Reason The Stewards heard from the driver of Car 44 (Lewis Hamilton) an

## Next Steps

After running this notebook:
1. Review the extracted Mercedes documents
2. Check the sample text to ensure proper extraction
3. Proceed to consolidation and preprocessing steps
4. Begin text summarization process
