## OCR-TRANSCRIPTION Hotfix Changes Notebook

- The initial issue was that the Directory Class was not using OCR or transcriptions on individual files 
- The cause of this issue can be traced back to Line 222 in directory.py 
- `data = [file.processor.__dict__ for file in batch]`

- The issue is that when OCR or transcription is used, the decorator first calls the base processor and then adds its own processing on top
- This causes the decorator attributes to be called when using `processor.__dict__` instead of the base processor which has all the attributes 

In [1]:
from file_processing import File
# With OCR and a custom Tesseract path (requires full version):
ocr_path = 'C:/Users/SLAM/AppData/Local/Programs/Tesseract-OCR/tesseract.exe'
file = File('tests/resources/directory_test_files/Health_Canada_logo.png', use_ocr=True, ocr_path=ocr_path)
# OCR Decorator Processor being called 
file.processor.__dict__

{'_processor': <file_processing.processors.png_processor.PngFileProcessor at 0x1fb818e8670>}

In [2]:
# Have to go down to base processor to extract attributes when there is a decorator
file.processor._processor.__dict__

{'file_path': WindowsPath('tests/resources/directory_test_files/Health_Canada_logo.png'),
 'open_file': True,
 'file_name': 'Health_Canada_logo.png',
 'owner': 'AD/SLAM',
 'extension': '.png',
 'size': 4125,
 'modification_time': 1720453907.6850233,
 'access_time': 1720709801.8869658,
 'creation_time': 1720453907.6850233,
 'parent_directory': WindowsPath('tests/resources/directory_test_files'),
 'permissions': '666',
 'is_file': True,
 'is_symlink': False,
 'absolute_path': WindowsPath('C:/Users/SLAM/Projects/file-processing-tools/tests/resources/directory_test_files/Health_Canada_logo.png'),
 'metadata': {'original_format': 'GIF',
  'mode': 'P',
  'width': 303,
  'height': 40,
  'ocr_text': 'Health Sante\nCanada Canada\n'}}

## Changes
- Added use_transcribers parameter to Class Directory
- Separate and process files based on if they use decorators and change their processor attributes
    - Set the `file.processor.__dict__`  to `file.processor._processor.__dict__` when decorators are used so full attributes are accessed

use_ocr and use_transcribers can now be turned to True or False when creating instances of Class Dictionary 

In [3]:
# Setting use_ocr and use_transcribers to True 
from file_processing import Directory
directory = Directory('tests/resources/directory_test_files',use_ocr=True,use_transcribers = True) 
for file_obj in directory._file_generator():
        print(file_obj.processor.__dict__)

Processing files: 2 files completed [00:08,  4.03s/ files completed]

{'file_path': WindowsPath('tests/resources/directory_test_files/2021_Census_English.csv'), 'open_file': True, 'file_name': '2021_Census_English.csv', 'owner': 'AD/SLAM', 'extension': '.csv', 'size': 5384414, 'modification_time': 1720453907.6744914, 'access_time': 1720709799.5489495, 'creation_time': 1720453907.6744914, 'parent_directory': WindowsPath('tests/resources/directory_test_files'), 'permissions': '666', 'is_file': True, 'is_symlink': False, 'absolute_path': WindowsPath('C:/Users/SLAM/Projects/file-processing-tools/tests/resources/directory_test_files/2021_Census_English.csv'), 'metadata': {'text': 'CENSUS_YEAR","DGUID","ALT_GEO_CODE","GEO_LEVEL","GEO_NAME","TNR_SF","TNR_LF","DATA_QUALITY_FLAG","CHARACTERISTIC_ID","CHARACTERISTIC_NAME","CHARACTERISTIC_NOTE","C1_COUNT_TOTAL","SYMBOL","C2_COUNT_MEN+","SYMBOL","C3_COUNT_WOMEN+","SYMBOL","C10_RATE_TOTAL","SYMBOL","C11_RATE_MEN+","SYMBOL","C12_RATE_WOMEN+","SYMBOL\n2021","2021A000011124","01","Country","Canada","3.1","4.3","20000","

Processing files: 3 files completed [00:10,  3.24s/ files completed]ERROR:root:Error processing tests/resources/directory_test_files\ArtificialNeuralNetworksForBeginners_Locked.pdf: OCRProcessingError
Processing files: 4 files completed [00:10,  2.09s/ files completed]

{'file_path': WindowsPath('tests/resources/directory_test_files/ArtificialNeuralNetworksForBeginners.pdf'), 'open_file': True, 'file_name': 'ArtificialNeuralNetworksForBeginners.pdf', 'owner': 'AD/SLAM', 'extension': '.pdf', 'size': 221266, 'modification_time': 1720453907.682782, 'access_time': 1720709801.5658052, 'creation_time': 1720453907.682782, 'parent_directory': WindowsPath('tests/resources/directory_test_files'), 'permissions': '666', 'is_file': True, 'is_symlink': False, 'absolute_path': WindowsPath('C:/Users/SLAM/Projects/file-processing-tools/tests/resources/directory_test_files/ArtificialNeuralNetworksForBeginners.pdf'), 'metadata': {'text': 'Artificial  Neural Networks  for Beginner s\nCarlos G ershenson\nC.Gershenson@s ussex.ac.uk\n1. Introduc tion\nThe scope o f this teaching package is to  make a brief induction to Artifici al Neur al\nNetworks (ANNs)   for peo ple who have  no prev ious knowledge o f them. W e first make a  brief\nintroduction to mo dels o f networks, 

Processing files: 8 files completed [00:10,  1.55 files completed/s]

{'file_path': WindowsPath('tests/resources/directory_test_files/Health - Canada.ca.html'), 'open_file': True, 'file_name': 'Health - Canada.ca.html', 'owner': 'AD/SLAM', 'extension': '.html', 'size': 168865, 'modification_time': 1720453907.6850233, 'access_time': 1720709801.7589545, 'creation_time': 1720453907.6850233, 'parent_directory': WindowsPath('tests/resources/directory_test_files'), 'permissions': '666', 'is_file': True, 'is_symlink': False, 'absolute_path': WindowsPath('C:/Users/SLAM/Projects/file-processing-tools/tests/resources/directory_test_files/Health - Canada.ca.html'), 'metadata': {'text': '<!DOCTYPE html>\n<!-- saved from url=(0045)https://www.canada.ca/en/services/health.html -->\n<html class="js backgroundsize borderimage csstransitions fontface svg details progressbar meter mathml cors xlargeview wb-enable" dir="ltr" lang="en" xmlns="http://www.w3.org/1999/xhtml"><head prefix="og: http://ogp.me/ns#" class="at-element-marker"><meta http-equiv="Content-Type" content=

Processing files: 15 files completed [00:10,  1.37 files completed/s]

{'file_path': WindowsPath('tests/resources/directory_test_files/Microsoft Copilot Presentation - Copy.pptx'), 'open_file': True, 'file_name': 'Microsoft Copilot Presentation - Copy.pptx', 'owner': 'AD/SLAM', 'extension': '.pptx', 'size': 100174789, 'modification_time': 1720453908.0048223, 'access_time': 1720709801.9425883, 'creation_time': 1720453907.9840283, 'parent_directory': WindowsPath('tests/resources/directory_test_files'), 'permissions': '666', 'is_file': True, 'is_symlink': False, 'absolute_path': WindowsPath('C:/Users/SLAM/Projects/file-processing-tools/tests/resources/directory_test_files/Microsoft Copilot Presentation - Copy.pptx'), 'metadata': {'text': 'MICROSOFT COPILOT\nAn Overview\n\n2\nTopics\nIntroduction\nDescription\nCapabilities\nPrivacy and Security\nConclusion\nINTRODUCTION\nCopilot combines the power of large language models (LLMs) with Microsoft 365 Apps to improve productivity\n\nCopilot versions\nWindows 11 Copilot\nBuilt into the Windows\xa0operating system\




Generation of reports and analytics now can include OCR and Transcription

In [4]:
from file_processing import Directory
directory = Directory('tests/resources/directory_test_files',use_ocr=True,use_transcribers = True) 
directory.generate_report(report_file='directoryreporttest.csv',include_text=True)

Processing files: 3 files completed [00:09,  3.13s/ files completed]ERROR:root:Error processing tests/resources/directory_test_files\ArtificialNeuralNetworksForBeginners_Locked.pdf: OCRProcessingError
Processing files: 15 files completed [00:10,  1.45 files completed/s]
Processing batches: 1 batches completed [00:10, 10.36s/ batches completed]


In [5]:
directory = Directory('tests/resources/directory_test_files',use_ocr=True,use_transcribers = True) 
directory.generate_analytics(report_file="analyticsreporttest.csv")

Processing files: 3 files completed [00:01,  1.58 files completed/s]ERROR:root:Error processing tests/resources/directory_test_files\ArtificialNeuralNetworksForBeginners_Locked.pdf: OCRProcessingError
Processing files: 15 files completed [00:02,  6.95 files completed/s]


{'size (MB)': {'.csv': 5.384414,
  '.docx': 0.019456,
  '.html': 0.168865,
  '.msg': 0.0768,
  '.pdf': 0.443368,
  '.png': 0.004125,
  '.pptx': 100.248831,
  '.rtf': 0.103257,
  '.txt': 0.039357,
  '.xlsx': 0.011885,
  '.xml': 0.004548,
  '.zip': 0.064254},
 'count': {'.csv': 1,
  '.docx': 1,
  '.html': 1,
  '.msg': 1,
  '.pdf': 2,
  '.png': 1,
  '.pptx': 3,
  '.rtf': 1,
  '.txt': 1,
  '.xlsx': 1,
  '.xml': 1,
  '.zip': 1}}