Healthcare Log Processing with Hadoop

Project Overview

A scalable, Hadoop-based solution for processing and analyzing healthcare ETL log files. This project transforms unstructured log data into a normalized SQL database, enabling efficient analysis of ETL processes, data transformations, and system patterns.

Key Features

Processes large-scale healthcare ETL log files using Hadoop/PySpark
Extracts stored procedures and database operations from log text
Creates normalized SQL mapping database for analysis
Runs in containerized environment using Docker
Enables tracking of data transformations and ETL patterns

Business Value

Data Analysis: Transform raw logs into structured, queryable data
Pattern Recognition: Identify common ETL patterns and transformations
Process Monitoring: Track and analyze ETL operations
Scalability: Handle large volumes of log data efficiently
Maintainability: Containerized solution for easy deployment

Prerequisites

Required Software

Python 3.11.x
- Download from Python.org
- During installation:
  - Check "Add Python to PATH"
  - Choose "Customize installation"
  - Select all optional features
Docker Desktop
- Requirements:
  - Windows 10/11 Home, Pro, Enterprise, or Education
  - WSL 2 (Windows Subsystem for Linux)
  - 8GB RAM minimum (16GB recommended)
- Installation:
  - Download Docker Desktop
  - Choose AMD64 version for Intel/AMD processors
  - Follow installation prompts
  - Accept Docker Subscription Service Agreement (free for personal use)

WSL 2 Setup

# Run in PowerShell as Administrator
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart
wsl --install -d Ubuntu
wsl --set-default-version 2

VSCode Extensions
- Python
- Docker
- Remote Development

Project Structure

healthcare_hadoop/
├── data/                  # Directory for database files
├── src/
│   ├── config.py         # Configuration settings
│   ├── db_setup.py       # Database initialization
│   └── log_processor.py  # Main processing logic
├── Dockerfile            # Docker configuration
├── docker-compose.yml    # Multi-container Docker setup
├── requirements.txt      # Python dependencies
└── README.md            # Project documentation

Installation & Setup

Clone the Repository

git clone [your-repository-url]
cd healthcare_hadoop

Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
# On Windows:
.\venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Initialize Databases
```
python src/db_setup.py
```

Docker Setup

# Build and start containers
docker-compose up -d

Usage

Process Log Files
```
python src/log_processor.py
```
View Results
- Check the mapping database in data/target.db
- Use SQL queries to analyze extracted data

Development

Local Development

Use VSCode with Python extension
Set up Python interpreter from virtual environment
Use integrated terminal for commands
Debug with VSCode's debugging tools

Docker Development

Build development container:

docker-compose -f docker-compose.dev.yml build

Start development environment:

docker-compose -f docker-compose.dev.yml up

Troubleshooting

Common Issues

Docker Desktop Not Starting
- Verify WSL 2 installation
- Check system requirements
- Run as administrator
- Restart Docker service
Python Environment Issues
- Verify Python installation
- Check PATH environment variable
- Recreate virtual environment
Database Connection Issues
- Check database files in data directory
- Verify permissions
- Check connection strings in config.py

Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Thanks to the Apache Hadoop and Spark communities
Healthcare data warehouse team for project requirements
Contributors and maintainers

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Healthcare Log Processing with Hadoop

Project Overview

Key Features

Business Value

Prerequisites

Required Software

Project Structure

Installation & Setup

Usage

Development

Local Development

Docker Development

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

License

maryasad/LogStructure_PySpark

Folders and files

Latest commit

History

Repository files navigation

Healthcare Log Processing with Hadoop

Project Overview

Key Features

Business Value

Prerequisites

Required Software

Project Structure

Installation & Setup

Usage

Development

Local Development

Docker Development

Troubleshooting

Common Issues

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages