A scalable, Hadoop-based solution for processing and analyzing healthcare ETL log files. This project transforms unstructured log data into a normalized SQL database, enabling efficient analysis of ETL processes, data transformations, and system patterns.
- Processes large-scale healthcare ETL log files using Hadoop/PySpark
- Extracts stored procedures and database operations from log text
- Creates normalized SQL mapping database for analysis
- Runs in containerized environment using Docker
- Enables tracking of data transformations and ETL patterns
- Data Analysis: Transform raw logs into structured, queryable data
- Pattern Recognition: Identify common ETL patterns and transformations
- Process Monitoring: Track and analyze ETL operations
- Scalability: Handle large volumes of log data efficiently
- Maintainability: Containerized solution for easy deployment
-
Python 3.11.x
- Download from Python.org
- During installation:
- Check "Add Python to PATH"
- Choose "Customize installation"
- Select all optional features
-
Docker Desktop
- Requirements:
- Windows 10/11 Home, Pro, Enterprise, or Education
- WSL 2 (Windows Subsystem for Linux)
- 8GB RAM minimum (16GB recommended)
- Installation:
- Download Docker Desktop
- Choose AMD64 version for Intel/AMD processors
- Follow installation prompts
- Accept Docker Subscription Service Agreement (free for personal use)
- Requirements:
-
WSL 2 Setup
# Run in PowerShell as Administrator dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart wsl --install -d Ubuntu wsl --set-default-version 2
-
VSCode Extensions
- Python
- Docker
- Remote Development
healthcare_hadoop/
├── data/ # Directory for database files
├── src/
│ ├── config.py # Configuration settings
│ ├── db_setup.py # Database initialization
│ └── log_processor.py # Main processing logic
├── Dockerfile # Docker configuration
├── docker-compose.yml # Multi-container Docker setup
├── requirements.txt # Python dependencies
└── README.md # Project documentation
-
Clone the Repository
git clone [your-repository-url] cd healthcare_hadoop -
Set Up Python Environment
# Create virtual environment python -m venv venv # Activate virtual environment # On Windows: .\venv\Scripts\activate # Install dependencies pip install -r requirements.txt
-
Initialize Databases
python src/db_setup.py
-
Docker Setup
# Build and start containers docker-compose up -d
-
Process Log Files
python src/log_processor.py
-
View Results
- Check the mapping database in
data/target.db - Use SQL queries to analyze extracted data
- Check the mapping database in
- Use VSCode with Python extension
- Set up Python interpreter from virtual environment
- Use integrated terminal for commands
- Debug with VSCode's debugging tools
- Build development container:
docker-compose -f docker-compose.dev.yml build
- Start development environment:
docker-compose -f docker-compose.dev.yml up
-
Docker Desktop Not Starting
- Verify WSL 2 installation
- Check system requirements
- Run as administrator
- Restart Docker service
-
Python Environment Issues
- Verify Python installation
- Check PATH environment variable
- Recreate virtual environment
-
Database Connection Issues
- Check database files in data directory
- Verify permissions
- Check connection strings in config.py
- Fork the repository
- Create your feature branch
- Commit your changes
- Push to the branch
- Create a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Thanks to the Apache Hadoop and Spark communities
- Healthcare data warehouse team for project requirements
- Contributors and maintainers