# 🗄️ Hadoop HDFS Basics

This notebook demonstrates basic HDFS operations in the Big Data environment.

## Learning Objectives
- Connect to HDFS
- Perform basic file operations
- Upload and download files
- Explore HDFS directory structure

## 1. Environment Setup

In [None]:
import os
import subprocess
import pandas as pd
from hdfs import InsecureClient

# HDFS Configuration
HDFS_URL = 'http://namenode:9870'
HDFS_USER = 'root'

print('🚀 Big Data Environment - Hadoop HDFS Basics')
print('=' * 50)
print(f'HDFS URL: {HDFS_URL}')
print(f'User: {HDFS_USER}')

## 2. Connect to HDFS

In [None]:
# Create HDFS client
try:
    client = InsecureClient(HDFS_URL, user=HDFS_USER)
    print('✅ Successfully connected to HDFS')
    
    # Test connection by listing root directory
    root_files = client.list('/')
    print(f'📁 Root directory contains: {root_files}')
    
except Exception as e:
    print(f'❌ Failed to connect to HDFS: {e}')

## 3. Basic HDFS Operations

In [None]:
# Create directories
print('📁 Creating HDFS directories...')

directories = [
    '/user/demo',
    '/user/demo/input',
    '/user/demo/output',
    '/user/demo/processed'
]

for directory in directories:
    try:
        client.makedirs(directory)
        print(f'✅ Created directory: {directory}')
    except:
        print(f'ℹ️  Directory already exists: {directory}')

In [None]:
# List directory contents
print('📋 Listing HDFS directories:')
print('\n/user directory:')
try:
    user_files = client.list('/user', status=True)
    for item in user_files:
        file_type = 'DIR' if item[1]['type'] == 'DIRECTORY' else 'FILE'
        size = item[1]['length']
        print(f'  {file_type:4} {size:>10} {item[0]}')
except Exception as e:
    print(f'❌ Error listing directory: {e}')

## 4. Upload Files to HDFS

In [None]:
# Upload sample data files to HDFS
print('📤 Uploading files to HDFS...')

local_files = {
    '/home/jovyan/data/users.csv': '/user/demo/input/users.csv',
    '/home/jovyan/data/transactions.json': '/user/demo/input/transactions.json',
    '/home/jovyan/data/logs.txt': '/user/demo/input/logs.txt'
}

for local_path, hdfs_path in local_files.items():
    try:
        if os.path.exists(local_path):
            client.upload(hdfs_path, local_path, overwrite=True)
            print(f'✅ Uploaded: {local_path} → {hdfs_path}')
        else:
            print(f'⚠️  File not found: {local_path}')
    except Exception as e:
        print(f'❌ Error uploading {local_path}: {e}')

In [None]:
# Verify uploaded files
print('🔍 Verifying uploaded files:')
try:
    input_files = client.list('/user/demo/input', status=True)
    for item in input_files:
        file_type = 'DIR' if item[1]['type'] == 'DIRECTORY' else 'FILE'
        size = item[1]['length']
        modified = item[1]['modificationTime']
        print(f'  {file_type:4} {size:>8} bytes {item[0]}')
except Exception as e:
    print(f'❌ Error listing files: {e}')

## 5. Read Files from HDFS

In [None]:
# Read CSV file from HDFS
print('📖 Reading users.csv from HDFS:')
try:
    with client.read('/user/demo/input/users.csv') as reader:
        users_data = reader.read().decode('utf-8')
    
    # Display first few lines
    lines = users_data.split('\n')[:6]
    for i, line in enumerate(lines):
        if line.strip():
            print(f'  {i+1:2}: {line}')
    print(f'  ... ({len(users_data.split(chr(10)))-1} total lines)')
    
except Exception as e:
    print(f'❌ Error reading file: {e}')

In [None]:
# Read and parse CSV using pandas
print('📊 Processing CSV data with pandas:')
try:
    with client.read('/user/demo/input/users.csv') as reader:
        df_users = pd.read_csv(reader)
    
    print(f'✅ Loaded {len(df_users)} users')
    print('\nFirst 5 records:')
    print(df_users.head())
    
    print('\nData summary:')
    print(f'  - Total users: {len(df_users)}')
    print(f'  - Countries: {df_users["country"].nunique()}')
    print(f'  - Average age: {df_users["age"].mean():.1f} years')
    
except Exception as e:
    print(f'❌ Error processing CSV: {e}')

## 6. HDFS File Operations

In [None]:
# Create a processed file and upload to HDFS
print('🔄 Processing data and saving to HDFS:')
try:
    # Process the users data
    country_summary = df_users.groupby('country').agg({
        'user_id': 'count',
        'age': 'mean'
    }).rename(columns={'user_id': 'user_count', 'age': 'avg_age'})
    
    # Save processed data to local file first
    local_processed_file = '/tmp/country_summary.csv'
    country_summary.to_csv(local_processed_file)
    
    # Upload to HDFS
    hdfs_processed_file = '/user/demo/processed/country_summary.csv'
    client.upload(hdfs_processed_file, local_processed_file, overwrite=True)
    
    print('✅ Processed data saved to HDFS')
    print('Country Summary:')
    print(country_summary)
    
except Exception as e:
    print(f'❌ Error processing data: {e}')

In [None]:
# Get file information
print('ℹ️  File information:')
try:
    files_to_check = [
        '/user/demo/input/users.csv',
        '/user/demo/input/transactions.json',
        '/user/demo/processed/country_summary.csv'
    ]
    
    for file_path in files_to_check:
        try:
            status = client.status(file_path)
            print(f'\n📄 {file_path}:')
            print(f'   Size: {status["length"]} bytes')
            print(f'   Type: {status["type"]}')
            print(f'   Replication: {status["replication"]}')
            print(f'   Block Size: {status["blockSize"]} bytes')
        except:
            print(f'❌ File not found: {file_path}')
            
except Exception as e:
    print(f'❌ Error getting file info: {e}')

## 7. HDFS Administration Commands

In [None]:
# Check HDFS disk usage
print('💾 HDFS Disk Usage:')
try:
    # Use client.status to get disk usage info
    content_summary = client.content('/user/demo')
    print(f'Directory: /user/demo')
    print(f'  Files: {content_summary["fileCount"]}')
    print(f'  Directories: {content_summary["directoryCount"]}')
    print(f'  Size: {content_summary["length"]} bytes')
    print(f'  Space Consumed: {content_summary["spaceConsumed"]} bytes')
    
except Exception as e:
    print(f'❌ Error getting disk usage: {e}')

## 8. Cleanup (Optional)

In [None]:
# Optional: Clean up created files and directories
# Uncomment the following lines if you want to clean up

# print('🧹 Cleaning up HDFS files...')
# try:
#     client.delete('/user/demo', recursive=True)
#     print('✅ Cleanup completed')
# except Exception as e:
#     print(f'❌ Error during cleanup: {e}')

print('💡 To clean up, uncomment and run the cleanup code above')

## 🎯 Summary

In this notebook, you learned:

1. **HDFS Connection**: How to connect to HDFS using Python
2. **Directory Operations**: Creating and listing directories
3. **File Upload**: Uploading local files to HDFS
4. **File Reading**: Reading and processing files from HDFS
5. **Data Processing**: Processing data and saving results back to HDFS
6. **File Management**: Getting file information and disk usage

### Next Steps
- Explore the **02-spark-intro.ipynb** notebook to learn Spark basics
- Check out the Hadoop NameNode UI at http://localhost:9870
- Browse HDFS files through the web interface

### 🔗 Useful Links
- **NameNode UI**: http://localhost:9870
- **DataNode UI**: http://localhost:9864
- **HDFS Documentation**: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html