Skip to content

Production-ready Docker configurations and deployment scripts

maxxunit1/data-processing-scripts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Data Processing Scripts

Collection of Python scripts for data processing, transformation, and analysis. Handle CSV, JSON, Excel files with ease.

✨ Features

  • πŸ“Š CSV data processing and transformation
  • πŸ“ˆ Excel file manipulation
  • πŸ”„ JSON data conversion
  • πŸ“‰ Data cleaning and validation
  • πŸ“¦ Batch processing support

πŸš€ Quick Start

Installation

git clone https://github.com/YOUR_USERNAME/data-processing-scripts.git
cd data-processing-scripts
pip install -r requirements.txt

Basic Usage

# Process CSV file
python csv_processor.py input.csv output.csv

# Clean data
python data_cleaner.py data.csv

# Convert formats
python format_converter.py data.json data.csv

πŸ“¦ Scripts

csv_processor.py

Process and transform CSV files with filtering, sorting, and aggregation.

python csv_processor.py input.csv output.csv --filter "age>25" --sort name

data_cleaner.py

Clean and validate data by removing duplicates, handling missing values.

python data_cleaner.py input.csv --remove-duplicates --fill-missing

format_converter.py

Convert between CSV, JSON, and Excel formats.

python format_converter.py input.csv output.json
python format_converter.py data.json data.xlsx

batch_processor.py

Process multiple files in batch mode.

python batch_processor.py --input-dir ./data --output-dir ./processed

πŸ“‚ Project Structure

data-processing-scripts/
β”œβ”€β”€ README.md              # Documentation
β”œβ”€β”€ requirements.txt       # Dependencies
β”œβ”€β”€ csv_processor.py       # CSV processing
β”œβ”€β”€ data_cleaner.py        # Data cleaning
β”œβ”€β”€ format_converter.py    # Format conversion
β”œβ”€β”€ batch_processor.py     # Batch processing
└── .gitignore            # Git ignore

πŸ”§ Advanced Usage

CSV Processing

from csv_processor import CSVProcessor

processor = CSVProcessor('data.csv')
processor.filter(lambda row: row['age'] > 25)
processor.sort_by('name')
processor.save('output.csv')

Data Cleaning

from data_cleaner import DataCleaner

cleaner = DataCleaner('data.csv')
cleaner.remove_duplicates()
cleaner.fill_missing_values(method='mean')
cleaner.save('clean_data.csv')

Format Conversion

from format_converter import convert

convert('data.csv', 'data.json')
convert('data.json', 'data.xlsx')

πŸ“Š Features Details

CSV Processor

  • Filter rows by conditions
  • Sort by columns
  • Aggregate data (sum, mean, count)
  • Column selection and renaming
  • Merge multiple CSV files

Data Cleaner

  • Remove duplicate rows
  • Handle missing values (fill, drop, interpolate)
  • Remove outliers
  • Standardize data formats
  • Validate data types

Format Converter

  • CSV ↔ JSON ↔ Excel
  • Preserves data types
  • Handles large files
  • Custom encoding support

Batch Processor

  • Process directory of files
  • Parallel processing support
  • Progress tracking
  • Error handling and logging

πŸ› οΈ Configuration

Create a config.json file:

{
  "encoding": "utf-8",
  "delimiter": ",",
  "chunk_size": 10000,
  "parallel_workers": 4
}

πŸ“ˆ Examples

Example 1: Filter and Sort

python csv_processor.py sales.csv filtered_sales.csv \
  --filter "revenue>1000" \
  --sort date \
  --columns date,customer,revenue

Example 2: Clean Data

python data_cleaner.py users.csv \
  --remove-duplicates \
  --fill-missing mean \
  --remove-outliers

Example 3: Batch Conversion

python batch_processor.py \
  --input-dir ./raw_data \
  --output-dir ./processed \
  --format json \
  --parallel

πŸ” Troubleshooting

Memory Issues

For large files, use chunking:

processor.process_chunks(chunk_size=10000)

Encoding Problems

Specify encoding:

processor = CSVProcessor('data.csv', encoding='latin-1')

Performance

Enable parallel processing:

python batch_processor.py --parallel --workers 8

πŸ“ Best Practices

  1. Validate input data before processing
  2. Use chunking for large files
  3. Enable logging for debugging
  4. Backup data before transformation
  5. Test on sample before full processing

🀝 Contributing

  1. Fork the repository
  2. Create feature branch
  3. Commit changes
  4. Push to branch
  5. Open Pull Request

πŸ“„ License

MIT License - see LICENSE file


Made with ❀️ for data engineers

Update 2025-10-23 10:15:52

Enhanced: 2025-10-23 10:15:52

"""Documentation updated"""

Update 2025-10-29 14:37:36

@decorator def enhanced_function(): """Enhanced functionality""" return improved_result()

About

Production-ready Docker configurations and deployment scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages