# Batch processing



# Laboratory: Batch Processing of Signals in Files

**Last review:** November 30, 2025

## Instructions

### Dataset Selection
- Select your own dataset from the internet, or consult with your instructor regarding which dataset to use (e.g., TIMIT, ECG MIT-BIH, ogg_files_train_audio, etc.)
- Create Python scripts to perform the required tasks and develop a thorough understanding of the source code and results.
- You may (and should) use tools such as Copilot in VS Code and LLMs such as ChatGPT, but you must fully understand the complete code you write.

## Part I: Batch Processing of Multiple Files (`simple_batch_processing.py`)

### Objective
Implement recursive file discovery and batch processing with segmental analysis.

### Tasks

1. **Search for files recursively**
   - Recursively search for all files with a given extension within a folder
   - Store all file paths in a Pandas DataFrame
   - Save the DataFrame as a CSV text file

2. **Process each file in a loop**
   - Read the contents of each file (signal waveform, image, etc.)
   
   - **Process segments within each file** (e.g., windows of N=100 samples or blocks of 8Ã—8 pixels):
     - Calculate the energy $e_i$ of the i-th segment and store in an array
     - Print the energy of each segment/block to standard output
   
   - **Visualization**
     - Plot the signal information and its segmental energy using `subplot(2,1,1)` and `subplot(2,1,2)`
     - Display the signal waveform in one subplot and the energy curve (energy per segment) in the other

## Part II: Histogram Calculation of Amplitudes (`display_histograms.py`)

### Objective
Compute signal statistics and amplitude histograms for each file in the dataset.

### Tasks

1. **Learn file properties**
   - Determine the sampling interval $T_s$ (in seconds) for each waveform file in the dataset
   - Use $T_s$ to calculate the duration of each waveform or segment

2. **Calculate statistics**
   - Perform similar processing as Part I to compute statistics for each file:
     - **Amplitude statistics:** minimum, maximum, and mean amplitude
     - **Duration statistics:** minimum, maximum, and mean duration (if working with waveforms)

3. **Generate histograms**
   - In a loop, calculate and display the histogram of amplitudes for each waveform file in the dataset

## Part III: Efficient Processing of Histograms for Large Datasets (`all_files_histogram.py`)

### Objective
Compute a unified histogram across all files in the dataset while minimizing memory usage.

### Tasks

1. **Aggregate histogram computation**
   - Calculate a single histogram for the entire dataset (combining data from all files)
   - Ensure memory-efficient processing to handle large numbers of files without excessive memory consumption
   - The resulting histogram should represent the amplitude distribution across all files in the dataset

## Part IV: Saving Output Files While Preserving Folder Structure (`create_pngs.py`)

### Objective
Process histograms for all files and save results as PNG images while maintaining the input folder hierarchy in the output directory.

### Tasks

1. **Repeat histogram calculation**
   - Perform the same histogram calculations as Part II

2. **Save histograms as PNG files**
   - For each file with extension "myextension" in the input root folder
   - Create a corresponding PNG output file with the same name and subfolder structure
   - **Example:** 
     - Input: `inputfolder\subfolder1\subfolder2\myname.wav`
     - Output: `outputfolder\subfolder1\subfolder2\myname.png`
   - `outputfolder` is provided by the user as a command-line argument

3. **Folder management**
   - Automatically create all necessary folders in the output directory
   - **Error handling:** Display an error message if `outputfolder` already exists (to prevent overwriting previously generated files)

### Important Guidelines

#### Avoid Magic Numbers
- **Never use hardcoded values** scattered throughout your code
- Instead, define named constants at the top of your script with explanatory comments
- **Bad example:**
  ```python
  for i in range(100):  # What does 100 mean?
  ```
- **Good example:**
  ```python
  SEGMENT_SIZE = 100  # Process audio in 100-sample segments
  for i in range(SEGMENT_SIZE):
  ```

#### Use Functions Effectively
- Create reusable functions/methods whenever appropriate
- This improves code readability, maintainability, and testability
- Avoid repetitive code by extracting common operations into functions