# SOC 2 Type 2 Document Analysis Learning Lab
## Parsing Unstructured Vendor Documentation

**Generated for:** TPRM Lead, B2B SaaS Tech Company

**Your Goal:** Transform 3+ hours of manual SOC 2 reading per vendor into 30-45 minutes of structured analysis

**Why This Matters for Your Role:** With 500+ vendors and manual reviews creating bottlenecks, automating document parsing will free you to focus on risk assessment and stakeholder communication—your strongest areas.

**Technical Leap:** Moving from "reading PDFs" → "extracting structured data programmatically"

---

## 📅 12-Week Progressive Build
*3 hours/week = 36 total hours*

### Phase 1: Foundation (Weeks 1-2)
- Week 1: Document Structure Mapping
- Week 2: Manual Baseline Measurement

### Phase 2: Core Skills (Weeks 3-5)
- Week 3: Your First Python PDF Parser
- Week 4: Finding Specific Sections
- Week 5: Extracting Structured Data

### Phase 3: Integration (Weeks 6-8)
- Week 6: Batch Processing Multiple PDFs
- Week 7: Google Sheets Integration
- Week 8: Building Dashboard

### Phase 4: Advanced (Weeks 9-10)
- Week 9: Exception Pattern Analysis
- Week 10: Automated Risk Scoring

### Phase 5: Production (Weeks 11-12)
- Week 11: Production Runbook & Documentation
- Week 12: Stakeholder Presentation & Metrics

---

# Week 1: Document Structure Mapping
**Time:** 3 hours  
**Difficulty:** Beginner (No coding)

## 🎯 Learning Goals
- [ ] Understand SOC 2 Type 2 report structure
- [ ] Identify the data points you always need to extract
- [ ] Map your current manual process
- [ ] Identify time-consuming pain points

## Activity 1: Visual Mapping Exercise (90 mins)

Open a recent SOC 2 report and create a Google Doc with this template:

```
DOCUMENT: [Vendor Name] SOC 2 Type 2

SECTIONS I ALWAYS CHECK:
□ Opinion letter (page __)
□ Control descriptions (pages __)
□ Testing results (pages __)
□ Exceptions/deviations (pages __)
□ Subservice organizations (pages __)

KEY DATA POINTS I EXTRACT:
- Report period: [dates]
- Opinion type: [Unqualified/Qualified/Adverse]
- Trust Service Categories: [List]
- Number of controls tested: [#]
- Exceptions found: [#]
- Complementary user entity controls: [Y/N, how many]

RED FLAGS I LOOK FOR:
- [List your current mental checklist]

TIME SPENT:
- Finding these sections: __ mins
- Reading opinion: __ mins
- Reviewing exceptions: __ mins
- Assessing CUECs: __ mins
```

**Repeat this for 2 more vendors to identify patterns.**

## Activity 2: Create Your "Extraction Checklist" (30 mins)

Build a Google Sheet with columns for the data points you always need:

| Vendor | Report Period | Opinion | TSC Covered | Controls Tested | Exceptions | CUECs | Review Time |
|--------|---------------|---------|-------------|-----------------|------------|-------|-------------|
| Vendor A | ... | ... | ... | ... | ... | ... | 3.5 hours |

**This becomes your target output structure for automation.**

## ✅ Week 1 Deliverable

- [ ] Structured template showing what you need from every SOC 2 report
- [ ] Google Sheet with extraction checklist columns
- [ ] Documented pain points (where does time go?)

**💡 Key Insight:** You should now have a clear picture of WHAT to automate. Next week, we'll measure HOW LONG it takes manually.

---

# Week 2: Manual Baseline Measurement
**Time:** 3 hours  
**Difficulty:** Beginner (No coding)

## 🎯 Learning Goals
- [ ] Establish baseline metrics (current state)
- [ ] Identify automation opportunities
- [ ] Document error-prone tasks
- [ ] Prioritize what to automate first

## Activity: Timed Review Session (2.5 hours)

Review 3 new SOC 2 reports using your extraction checklist from Week 1.

**Set a timer for each report and track:**
- Total time per report
- Time finding each section
- Time reading vs. copy-pasting
- Any data points you missed initially

**Document in a Google Doc:**

```
MANUAL REVIEW PAIN POINTS

Most time-consuming tasks:
1. Finding exception details scattered across pages 45-89: 45 mins
2. Determining which Trust Service Categories are covered: 15 mins
3. Copy-pasting into tracking spreadsheet: 20 mins

Most error-prone tasks:
1. Missing complementary controls in dense paragraphs
2. Miscounting number of exceptions

Most tedious tasks:
1. Finding the same sections in every report
2. Formatting data consistently in spreadsheet
```

## ✅ Week 2 Deliverable

- [ ] Baseline metrics: Average time per report = ____ hours
- [ ] Prioritized list of automation targets
- [ ] Pain point documentation

**💡 Key Insight:** You now have concrete numbers to show time savings after automation. Keep these metrics—you'll present them to your manager in Week 12!

---

# Week 3: Your First Python PDF Parser
**Time:** 3 hours  
**Difficulty:** Intermediate

## 🎯 Learning Goals
- [ ] Set up Python environment
- [ ] Install PDF processing library
- [ ] Write your first PDF extraction script
- [ ] Understand PDF text extraction quality

## Setup (30 mins)

Install the required library:

In [None]:
# Run this once to install PyPDF2
!pip install pypdf2

In [None]:
# Import required libraries
import PyPDF2
import sys

print("✅ Libraries imported successfully!")

## Your First PDF Extraction Script (2 hours)

Let's build a function that extracts all text from a PDF:

In [None]:
def extract_text_from_pdf(pdf_path):
    """
    Opens a PDF and extracts all text.
    
    Args:
        pdf_path: Path to your PDF file
    
    Returns:
        String containing all the text
    """
    # Open the PDF file in "read binary" mode
    with open(pdf_path, 'rb') as file:
        # Create a PDF reader object (this does the heavy lifting)
        pdf_reader = PyPDF2.PdfReader(file)
        
        # Get the total number of pages
        num_pages = len(pdf_reader.pages)
        print(f"📄 Found {num_pages} pages in the document")
        
        # Create an empty string to store all the text
        full_text = ""
        
        # Loop through each page
        for page_num in range(num_pages):
            # Get this specific page
            page = pdf_reader.pages[page_num]
            
            # Extract text from this page
            page_text = page.extract_text()
            
            # Add it to our full text with a page marker
            full_text += f"\n--- PAGE {page_num + 1} ---\n"
            full_text += page_text
        
        return full_text

print("✅ Function defined successfully!")

## Test with a Real SOC 2 Report

**Replace the path below with YOUR actual SOC 2 PDF:**

In [None]:
# UPDATE THIS PATH to your SOC 2 PDF location
pdf_path = "/path/to/your/soc2_report.pdf"

print(f"🔍 Extracting text from: {pdf_path}")

# Extract the text
extracted_text = extract_text_from_pdf(pdf_path)

# Show statistics
print(f"✅ Extraction complete!")
print(f"📊 Total characters: {len(extracted_text):,}")
print(f"📊 Total lines: {len(extracted_text.splitlines()):,}")

## Preview the Extracted Text

Let's look at the first 1000 characters:

In [None]:
print("📄 PREVIEW OF EXTRACTED TEXT:")
print("="*50)
print(extracted_text[:1000])
print("="*50)
print("\n[...] (showing first 1000 characters)")

## Save to Text File

So you can review extraction quality in a text editor:

In [None]:
# Create output filename
output_file = pdf_path.replace('.pdf', '_extracted.txt')

# Write to file
with open(output_file, 'w', encoding='utf-8') as f:
    f.write(extracted_text)

print(f"✅ Text saved to: {output_file}")
print(f"💡 Open this file in TextEdit or VS Code to review")

## Observation Exercise (45 mins)

Compare the extracted text to the original PDF:

**What extracted well?**
- Paragraphs in the opinion letter
- Control descriptions
- [Add your observations]

**What was messy?**
- Tables lost their structure
- Multi-column layouts ran together
- [Add your observations]

**What was missing?**
- Page headers/footers
- [Add your observations]

## ✅ Week 3 Deliverable

- [ ] Working script that extracts text from any SOC 2 PDF
- [ ] Extracted text file saved for review
- [ ] Documented observations about extraction quality

**💡 Key Insight:** Not all PDFs extract perfectly, but it's good enough to work with! Next week, we'll make this smarter by finding specific sections.

---

# Week 4: Finding Specific Sections
**Time:** 3 hours  
**Difficulty:** Intermediate

## 🎯 Learning Goals
- [ ] Use regular expressions to find keywords
- [ ] Locate specific sections (Opinion, Exceptions, CUECs)
- [ ] Extract context around matches
- [ ] Save findings to structured format

## Import Additional Libraries

In [None]:
import re  # Regular expressions for pattern matching

print("✅ Regular expressions module imported")

## Define Section Keywords

Based on your Week 1 mapping, define the keywords for sections you care about:

In [None]:
# Define the sections you care about
SECTION_KEYWORDS = {
    'Opinion Letter': [
        'independent service auditor',
        'in our opinion',
        'unqualified opinion',
        'qualified opinion'
    ],
    'Report Period': [
        'report period',
        'examination period',
        'from .* to .*202[0-9]'  # Regex to catch date ranges
    ],
    'Exceptions': [
        'exception',
        'deviation',
        'deficiency',
        'did not operate effectively'
    ],
    'Complementary Controls': [
        'complementary user entity controls',
        'CUEC',
        'user entity controls',
        'complementary controls'
    ],
    'Subservice Organizations': [
        'subservice organi[zs]ation',
        'carve.?out',
        'inclusive method',
        'carved.?out'
    ]
}

print("✅ Section keywords defined")

## Build Section Finder Function

In [None]:
def find_sections(text, section_keywords):
    """
    Finds sections in the text based on keywords.
    
    Args:
        text: The full extracted text
        section_keywords: Dictionary of section names and their keywords
    
    Returns:
        Dictionary with section names and their locations
    """
    results = {}
    
    for section_name, keywords in section_keywords.items():
        print(f"\n🔍 Looking for: {section_name}")
        
        found_locations = []
        for keyword in keywords:
            # Case-insensitive search
            pattern = re.compile(keyword, re.IGNORECASE)
            matches = pattern.finditer(text)
            
            for match in matches:
                # Get context (100 characters before and after)
                start = max(0, match.start() - 100)
                end = min(len(text), match.end() + 100)
                context = text[start:end]
                
                found_locations.append({
                    'keyword': keyword,
                    'position': match.start(),
                    'context': context
                })
        
        results[section_name] = found_locations
        print(f"   Found {len(found_locations)} mentions")
    
    return results

print("✅ Section finder function defined")

## Run Section Finder on Your SOC 2

In [None]:
# Use the extracted text from Week 3
print("📄 Analyzing sections...\n")
print("="*50)

# Find all sections
results = find_sections(extracted_text, SECTION_KEYWORDS)

## Save Detailed Results

In [None]:
# Save to file for review
output_file = pdf_path.replace('.pdf', '_section_analysis.txt')

with open(output_file, 'w') as f:
    f.write("SOC 2 SECTION ANALYSIS\n")
    f.write("="*50 + "\n\n")
    
    for section_name, locations in results.items():
        f.write(f"\n{section_name.upper()}\n")
        f.write("-"*50 + "\n")
        
        if locations:
            for loc in locations:
                f.write(f"\nKeyword: {loc['keyword']}\n")
                f.write(f"Context: ...{loc['context']}...\n")
                f.write("\n")
        else:
            f.write("❌ Not found\n")

print(f"\n✅ Detailed analysis saved to: {output_file}")

## ✅ Week 4 Deliverable

- [ ] Script that automatically finds your priority sections
- [ ] Section analysis file showing all matches with context
- [ ] Time comparison: Manual section finding vs. script

**💡 Key Insight:** The script finds sections in seconds that would take 15-20 minutes manually. Next week, we'll extract structured data from these sections.

---

# Week 5-12: Continue Building...

**Note:** This notebook shows the structure for Weeks 1-4. The complete lab continues through Week 12 with:

- **Week 5:** Extract structured data to CSV
- **Week 6:** Batch process multiple PDFs
- **Week 7:** Upload to Google Sheets automatically
- **Week 8:** Build interactive dashboard
- **Week 9:** Exception pattern analysis
- **Week 10:** Automated risk scoring
- **Week 11:** Production runbook
- **Week 12:** Stakeholder presentation

Each week builds on previous code with progressive complexity, hands-on exercises, and real deliverables.

---

## 🎓 Next Steps

1. **Complete Weeks 1-4** following this notebook
2. **Test with 3 different SOC 2 reports** from different auditors
3. **Document time savings** vs. manual process
4. **Continue to Week 5** for structured data extraction

---

## 📚 Resources

- [PyPDF2 Documentation](https://pypdf2.readthedocs.io/)
- [Python Regular Expressions Guide](https://docs.python.org/3/library/re.html)
- [SOC 2 Report Structure Overview](https://www.aicpa.org/soc4so)

---

## 💬 Questions?

Check the [main repository](../../README.md) or see other [example labs](../)