# Saving sources as txt files

Script Purpose: Batch processes CSV files containing OCR-corrected text and converts them to individual TXT files.

Key Operations:

- Reads all CSV files from literature csv_clean folder

- Extracts the ocr_corrected_text column from each CSV

- Combines all text rows into a single string per file

- Cleans the text by removing 'nan' values and extra whitespace

- Saves each processed text as a separate TXT file in literature txt folder

- Filters out non-CSV files and hidden directories to avoid permission errors

- Provides progress feedback during processing

In [1]:
# Data manipulation and file operations
import pandas as pd  # For creating and managing DataFrames
import os  # For file and directory operations

# Progress tracking
from tqdm.notebook import tqdm  # Display progress bars in Jupyter notebooks
tqdm.pandas()

In [13]:
# ===================================================================
# BATCH PROCESSING: Process all csv files and store text as txt files
# ===================================================================

# Define input and output directories
departure_folder = r'.\literature csv_clean'     # Source folder containing clean CSV files
arrival_folder = r'.\literature txt'    # Destination folder for txt output

# Create the output directory if it doesn't exist
# exist_ok=True prevents errors if the directory already exists
os.makedirs(arrival_folder, exist_ok=True)

# Get a list of all files in the CSV folder
files_in_folder = os.listdir(departure_folder)

# Filter to only process CSV files and skip hidden files/folders
csv_files = [file for file in files_in_folder if file.endswith('.csv') and not file.startswith('.')]

print(f"Found {len(csv_files)} CSV files to process")

for file in csv_files:
    # Construct the full path to the CSV file
    file_path = os.path.join(departure_folder, file)
    df = pd.read_csv(file_path)
    _text = ' '.join(df['ocr_corrected_text'].astype(str).to_list())
    _text = _text.replace('nan', ' ').strip()
    
    # Create output filename by replacing .csv with .txt
    output_filename = file.replace('.csv', '.txt')
    output_path = os.path.join(arrival_folder, output_filename)
    
    # Save the text content to a txt file
    with open(output_path, 'w', encoding='utf-8') as f:
        f.write(_text)
    
    print(f"Processed: {file} -> {output_filename}")

print(f"\nBatch processing complete! All files saved to: {arrival_folder}")


Found 159 CSV files to process
Processed: 1772 Travels through Holland Flanders Germany Denmark Sweden Lapland Russia the Ukraine and Poland in the years 1768 1769 and 1770 Vol 1.csv -> 1772 Travels through Holland Flanders Germany Denmark Sweden Lapland Russia the Ukraine and Poland in the years 1768 1769 and 1770 Vol 1.txt
Processed: 1773 Travels through Holland Flanders Germany Denmark Sweden Lapland Russia the Ukraine and Poland in the years 1768 1769 and 1770 Vol 3.csv -> 1773 Travels through Holland Flanders Germany Denmark Sweden Lapland Russia the Ukraine and Poland in the years 1768 1769 and 1770 Vol 3.txt
Processed: 1795 A tour through some of the northern parts of Europe.csv -> 1795 A tour through some of the northern parts of Europe.txt
Processed: 1805 A northern summer.csv -> 1805 A northern summer.txt
Processed: 1807 A Tour round the Baltic.csv -> 1807 A Tour round the Baltic.txt
Processed: 1807 A tour through part of Germany Poland Russia Sweden Denmark.csv -> 1807 A tou