# Periculum DS Internship - Technical Assessment - Mubarak Ibrahim

This notebook implements a data pipeline for processing home inventory PDF files.

My approach:
1. First of all, I Extract raw text from PDF
2. Align content line by line
3. Extract structured data
4. And finally, Output the results as JSON

## Firstly I installed the Pypdf2 package and  Import Required Libraries # 

In [19]:
pip install PyPDF2 

Note: you may need to restart the kernel to use updated packages.


In [20]:

import json
from datetime import datetime
import PyPDF2  # I have already install with this as above : pip install PyPDF2
import PyPDF2 as pdf # here i import the package 
from datetime import datetime
from PyPDF2 import PdfReader, PdfWriter

In [21]:
dir(pdf)

['DocumentInformation',
 'PageObject',
 'PageRange',
 'PaperSize',
 'PasswordType',
 'PdfFileMerger',
 'PdfFileReader',
 'PdfFileWriter',
 'PdfMerger',
 'PdfReader',
 'PdfWriter',
 'Transformation',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_cmap',
 '_codecs',
 '_encryption',
 '_merger',
 '_page',
 '_protocols',
 '_reader',
 '_security',
 '_utils',
 '_version',
 '_writer',
 'constants',
 'errors',
 'filters',
 'generic',
 'pagerange',
 'papersizes',
 'parse_filename_page_ranges',
 'types',
 'xmp']

In [22]:
pdf.__version__ #just checking for the latest version from the documentation

'3.0.1'

In [23]:
# to get the fil"
pdf_path = "home_inventory.pdf"

file = open(pdf_path, "rb")
reader = PdfReader(file)

In [24]:
info = reader.metadata # to read the document information
print(info)

{'/Author': 'pc', '/CreationDate': 'D:20250425205051Z', '/Creator': 'Microsoft® Excel® 2019', '/ModDate': "D:20250425215148+01'00'", '/Producer': 'Adobe PDF Services'}


In [25]:
info.author


'pc'

In [26]:
info['/CreationDate']

'D:20250425205051Z'

In [27]:
len(reader.pages) #to check the total of pages

10

## Define Classes

As required, we need two classes:
- OwnerInfo for storing owner details
- Inventory for storing inventory items

In [28]:
class OwnerInfo:
    def __init__(self, owner_name="", owner_address="", owner_telephone=""):
        self.owner_name = owner_name
        self.owner_address = owner_address
        self.owner_telephone = owner_telephone
    
    def to_dict(self):
        # here to convert to dictionary for JSON
        return {
            "owner_name": self.owner_name,
            "owner_address": self.owner_address,
            "owner_telephone": self.owner_telephone
        }

In [29]:
class Inventory:
    def __init__(self, purchase_date="", serial_number="", description="", source_style_area="", value=""):
        # Initialize inventory item attributes
        self.purchase_date = purchase_date
        self.serial_number = serial_number 
        self.description = description
        self.source_style_area = source_style_area
        self.value = value
    
    def to_dict(self): # Also to Convert to dictionary in order to serialize the data in JSON format
        return {
            "purchase_date": self.purchase_date,
            "serial_number": self.serial_number,
            "description": self.description,
            "source_style_area": self.source_style_area,
            "value": self.value
        }

## PDF Processing Functions

Now let's implement the required functions step by step.

In [30]:
def get_data_from_pdf(pdf_path):
    """Extracts raw text from a PDF file."""
    text = ""
    
    try:
        with open(pdf_path, 'rb') as file:
            # Create a PDF reader object
            reader = PyPDF2.PdfReader(file)
            
            # Iterate through each page and extract text
            for page in reader.pages:
                text += page.extract_text()
                
    except Exception as e:
        print(f"Error reading PDF: {e}")
        
    return text
get_data_from_pdf(pdf_path)

'S/N Area Item Description Source Purchase Date Style Serial No Value\n1Living Room Desk Target 07/06/2018 Premium 6DDZ7S36 846.59 $        \n2Kitchen LED TV Walmart 31/05/2015 Classic NJEZ3OPO 382.04 $        \n3Living Room LED TV Target 03/03/2019 Premium HRIS4LI8 1,603.37$     \n4Garage Dining Table Wayfair 26/03/2023 Modern 2HLNMD64 552.74 $        \n5Living Room Tool Set Target 06/01/2020 Classic R08QDU0S 1,546.39$     \n6Office LED TV Amazon 03/04/2015 Classic GSABG41R 1,319.36$     \n7Office Mattress Amazon 28/02/2022 Premium DVTMS64O 573.10 $        \n8Bedroom Tool Set Home Depot 07/04/2017 Classic RB93WPTH 1,421.38$     \n9Dining Room Mattress Target 20/03/2017 Compact 7RX8YNW9 1,760.11$     \n10Living Room Desk Target 30/06/2019 Classic US1B0BQI 941.43 $        \n11Garage Dining Table Wayfair 09/02/2016 Compact 1J7088BO 71.12 $           \n12Bedroom Desk Home Depot 21/08/2025 Modern GVJFQYT1 841.41 $        \n13Office Dining Table Best Buy 23/09/2022 Modern 5EXQQOB1 1,686.96$

In [31]:
raw_text = get_data_from_pdf(pdf_path)

In [32]:
def align_content(raw_text): # to split the raw text content line by line.
    lines = raw_text.split('\n')
    clean_lines = []  # to remove empty lines and trim whitespace
    for line in lines:
        line = line.strip()
        if line:  # Only add non-empty lines
            clean_lines.append(line)
    
    return clean_lines
align_content(raw_text)

['S/N Area Item Description Source Purchase Date Style Serial No Value',
 '1Living Room Desk Target 07/06/2018 Premium 6DDZ7S36 846.59 $',
 '2Kitchen LED TV Walmart 31/05/2015 Classic NJEZ3OPO 382.04 $',
 '3Living Room LED TV Target 03/03/2019 Premium HRIS4LI8 1,603.37$',
 '4Garage Dining Table Wayfair 26/03/2023 Modern 2HLNMD64 552.74 $',
 '5Living Room Tool Set Target 06/01/2020 Classic R08QDU0S 1,546.39$',
 '6Office LED TV Amazon 03/04/2015 Classic GSABG41R 1,319.36$',
 '7Office Mattress Amazon 28/02/2022 Premium DVTMS64O 573.10 $',
 '8Bedroom Tool Set Home Depot 07/04/2017 Classic RB93WPTH 1,421.38$',
 '9Dining Room Mattress Target 20/03/2017 Compact 7RX8YNW9 1,760.11$',
 '10Living Room Desk Target 30/06/2019 Classic US1B0BQI 941.43 $',
 '11Garage Dining Table Wayfair 09/02/2016 Compact 1J7088BO 71.12 $',
 '12Bedroom Desk Home Depot 21/08/2025 Modern GVJFQYT1 841.41 $',
 '13Office Dining Table Best Buy 23/09/2022 Modern 5EXQQOB1 1,686.96$',
 '14Kitchen Dining Table Amazon 07/04/201

In [33]:
def parse_date(date_str):
    """Convert date from metedata i extracted earlier in my code 'D:20250425205051Z' format to ISO format."""
    try:
        # Handle PDF metadata date format (D:YYYYMMDD...)
        if date_str.startswith('D:'):
            date_str = date_str[2:]
            
            year = date_str[0:4]
            month = date_str[4:6]
            day = date_str[6:8]
            
            hour = date_str[8:10] if len(date_str) > 8 else "00"
            minute = date_str[10:12] if len(date_str) > 10 else "00"
            second = date_str[12:14] if len(date_str) > 12 else "00"
            
            date_obj = datetime(int(year), int(month), int(day), 
                               int(hour), int(minute), int(second))
        
        # Handle DD/MM/YYYY format
        elif '/' in date_str:
            day, month, year = date_str.split('/')
            date_obj = datetime(int(year), int(month), int(day))
        
        else:
            raise ValueError(f"Unsupported date format: {date_str}")
            
        return date_obj.strftime("%Y-%m-%dT%H:%M:%S")
    except (ValueError, IndexError) as e:
        # In case the date is invalid
        print(f"Warning: Invalid date format - {date_str}. Error: {e}")
        return ""
date_str = 'D:20250425205051Z'
print(parse_date(date_str))

2025-04-25T20:50:51


In [34]:
parse_date(date_str)

'2025-04-25T20:50:51'

## Getting to create the Full Pipeline Function

thios function runs the full pipeline and also saves the result to a JSON file.

In [37]:

raw_text = get_data_from_pdf(pdf_path)


aligned_content = align_content(raw_text)


owner_info = extract_owner_info(aligned_content)

In [38]:
def extract_owner_info(aligned_content):
    """Extract owner information from the content."""
    owner_info = OwnerInfo()
    
    for i, line in enumerate(aligned_content):
        if "Owner Information" in line:
            # Found the owner info section
            if i + 1 < len(aligned_content):
                owner_info.owner_name = aligned_content[i + 1]
            if i + 2 < len(aligned_content):
                owner_info.owner_address = aligned_content[i + 2]
            # Look for phone number
            for j in range(i+3, min(i+6, len(aligned_content))):
                if "(" in aligned_content[j] and ")" in aligned_content[j]:
                    owner_info.owner_telephone = aligned_content[j]
                    break
            break
    
    return owner_info
extract_owner_info(aligned_content)

<__main__.OwnerInfo at 0x2dbef9f02b0>

In [39]:
def find_inventory_table_start(aligned_content):
    """Find the starting index of the inventory table."""
    for i, line in enumerate(aligned_content):
        if "S/N Area" in line and "Item Description" in line:
            return i + 1  # Start from the next line
    
    print("Warning: Could not find inventory table header!")
    return -1


In [40]:
import re
from datetime import datetime
#  Define function to parse inventory rows

def parse_inventory_row(line):
    """Parse a single inventory row and return an Inventory object."""
    # Check if line starts with a number (inventory item number)
    match = re.match(r'^(\d+)\s', line)
    if not match:
        return None  # Not an inventory row
    
    parts = line.split()
    
    # Need at least a minimum number of parts
    if len(parts) < 7:  # Number, Area, Description, Source, Date, Style, Serial, Value
        return None
    
    # First extract index & area
    idx = 1  # Start after the item number
    
    # Area might be one or two words like "Living Room"
    if idx + 1 < len(parts) and parts[idx+1].lower() == "room":
        area = f"{parts[idx]} {parts[idx+1]}"
        idx += 2
    else:
        area = parts[idx]
        idx += 1
    
    # Look for the date which has a specific format DD/MM/YYYY
    date_idx = -1
    for j, part in enumerate(parts):
        if re.match(r'\d{2}/\d{2}/\d{4}', part):
            date_idx = j
            break
    
    if date_idx == -1:
        # No date found, skip this row
        return None
    # Now we can work backwards and forwards from the date
    purchase_date = parts[date_idx]
    source = parts[date_idx - 1]
    style = parts[date_idx + 1]
    serial_number = parts[date_idx + 2]
    value = parts[-1].replace('$', '').replace(',', '')
    
    # Description is between area and source
    description_parts = parts[idx:date_idx-1]
    description = ' '.join(description_parts)
    
    # Create the source_style_area field
    source_style_area = f"{source} {style} {area}"
    
    # Format the date properly
    iso_date = parse_date(purchase_date)
    
    return Inventory(
        purchase_date=iso_date,
        serial_number=serial_number,
        description=description,
        source_style_area=source_style_area,
        value=value
    )

In [41]:
# Define function to extract all inventory items
def extract_inventory_items(aligned_content):
    """Extract all inventory items from the content."""
    inventory_items = []
    table_start = find_inventory_table_start(aligned_content)
    
    if table_start == -1:
        return []
    
    # Process each inventory item row
    for i in range(table_start, len(aligned_content)):
        try:
            item = parse_inventory_row(aligned_content[i])
            if item:
                inventory_items.append(item.to_dict())
        except Exception as e:
            print(f"Error parsing line {i}: {e}")
            continue
    
    return inventory_items


In [42]:
# Define the main extract_data function
def extract_data(aligned_content):
    """Extract structured data from aligned content."""
    owner_info = extract_owner_info(aligned_content)
    inventory_items = extract_inventory_items(aligned_content)
    
    # Create the final result dictionary
    result = {
        **owner_info.to_dict(),
        "data": inventory_items
    }
    
    return result

In [43]:
#  Process the PDF and extract data
# I define the PDF path
pdf_path = "home_inventory.pdf" 

# Get raw text from PDF
raw_text = get_data_from_pdf(pdf_path)

# Clean and align the content
aligned_content = align_content(raw_text)

# Extract structured data
result = extract_data(aligned_content)

# Print the result in a readable format
print(json.dumps(result, indent=2))

# I also save to a JSON file
with open("extracted_inventory.json", "w") as f:
    json.dump(result, f, indent=2)

{
  "owner_name": "John Doe",
  "owner_address": "123 Maple StreetName",
  "owner_telephone": "",
  "data": [
    {
      "purchase_date": "2021-04-22T00:00:00",
      "serial_number": "U7PXVQAB",
      "description": "Stand Mixer Best",
      "source_style_area": "Buy Modern Dining Room",
      "value": ""
    },
    {
      "purchase_date": "2022-07-04T00:00:00",
      "serial_number": "4OGYNJGI",
      "description": "Tool Set Home",
      "source_style_area": "Depot Modern Dining Room",
      "value": ""
    },
    {
      "purchase_date": "2016-05-11T00:00:00",
      "serial_number": "Z9TTB4MF",
      "description": "Desk Home",
      "source_style_area": "Depot Premium Garage",
      "value": ""
    },
    {
      "purchase_date": "2017-10-21T00:00:00",
      "serial_number": "7XW6R69R",
      "description": "Dining Table Home",
      "source_style_area": "Depot Classic Living Room",
      "value": ""
    },
    {
      "purchase_date": "2017-12-08T00:00:00",
      "serial_number

## Challenges and Notes

While implementing this, I faced a few challenges:

1. PDF parsing can be messy - the text alignment might not be exactly as it appears visually in the PDF
2. The inventory items have variable formats, making it tricky to parse consistently
3. Date formatting needed special handling to convert to ISO format

Some improvements I could make with more time:
- Better error handling for edge cases
- More robust parsing of complex inventory descriptions
- Add unit tests to verify each function works correctly