# OOPs in python
- Data Engineers build resusable tools: readers, transformers, loggers.
- OOP help us modularize, scale, and reorganize pipeline steps cleanly.
- Instead of repeating code for each file format/API - Define a class once use everywhere.

#classes and objects (Core Foundataion)
###What are Classes & Objects?
- Class: A blueprint for creating objects (defines attributes & methods).
- Object: An instance of a class (contains real data & behavior).
###Key Concepts:
- __init__: Constructor (initializes object attributes).

- Methods: Functions defined inside a class (actions an object can perform).

- Instance Variables: Data unique to each object.

###Why Important in Data Engineering?
- Modularity: Breaks pipelines into reusable components (e.g., Extract, Transform, Load).

- Abstraction: Hides complex logic behind simple method calls (e.g., execute()).

- Scalability: Easily add new steps without rewriting existing code.

In [0]:
class customer:
    def __init__(self, name):
        self.name = name
c1 = customer("Gourav")
print(c1.name)

In [0]:
class PipelineStep:
    def __init__(self, step_name):  # Constructor
        self.step_name = step_name  # Instance variable (unique per object)

    def execute(self):  # Method (action)
        print("Executing step: " + self.step_name)

# Objects (instances of PipelineStep)
ingest = PipelineStep("Ingestion")  
transform = PipelineStep("Transformation")  

ingest.execute()    # Output: "Executing step: Ingestion"
transform.execute() # Output: "Executing step: Transformation"

#Encapsulation -protect Internal Logic

##What is Encapsulation?
- Bundling data (attributes) and methods (functions) into a single unit (class).
- Restricting direct access to sensitive data (data hiding).
- Provides security, maintainability, and controlled access.
###Why is it Important in Data Engineering?
- Protects sensitive data (e.g., database credentials, API keys).
- Hides complex logic (e.g., ETL transformations, connection handling).
- Improves modularity—changes inside a class don’t affect other code.

In [0]:
class databaseconnector:
    def __init__(self):
        self.credentials = "user:pass@123"  #Protected variable credentials, The underscore _ prefix indicates this is a protected variable
    def connect (self):
            print("Connecting to database using credentials: ")
            return "DB connection established"
db = databaseconnector()
# print(db.credentials)
print(db.connect())

## Inheritance in Python
### What is Inheritance?
- Reuse & Extend: A child class inherits attributes/methods from a parent class.
- Method Overriding: Child classes can modify inherited methods.
- Hierarchy: Creates logical relationships (e.g., BaseReader → CSVReader, APIReader).
###Why Important in Data Engineering?
- Avoids Code Duplication: Shared logic (e.g., read()) in a base class.
- Standardizes Interfaces: All readers must implement read().
- Extensibility: Add new readers (e.g., DBReader) without changing existing code.

In [0]:
# Example: Reader Class Hierarchy
class Reader():  # Parent class (Base)
    def read(self):
        return "Reading from Base Reader..."  # Default implementation

class CSVReader(Reader):  # Child class (inherits from Reader)
    def read(self):  # Method overriding
        return "📄 Reading from CSV"

class APIReader(Reader):  # Child class
    def read(self):  # Method overriding
        return "🌐 Fetching from API"

# Usage
print(CSVReader().read())  # Output: "📄 Reading from CSV"  
print(APIReader().read())  # Output: "🌐 Fetching from API"  

In [0]:
# Real-World Use Case:
class DataSource:  # Base class
    def extract(self):
        raise NotImplementedError("Child classes must implement this!")

class BigQuerySource(DataSource):
    def extract(self):
        return "Extracting from BigQuery..."

class S3Source(DataSource):
    def extract(self):
        return "Loading from S3..."
    
source = BigQuerySource()
print(source.extract())     # Output:Extracting from BigQuery...
load = S3Source()
print(load.extract())       # Output: Loading from S3...

## Polymorphism – Same Interface, Different Behavior
### What is Polymorphism?
- "Many Forms": One interface (e.g., method name) with different implementations.
- Shared Behavior: Objects of different classes respond to the same method call (read()).
- Flexibility: Code works with any class adhering to the expected interface.
### Polymorphism Simplified (With a Real-Life Analogy)
- Imagine you have a universal remote control that works with any TV brand (Sony, Samsung, LG).
- Same Button ("Power") → Different TVs respond differently (but all turn ON/OFF).
- You don’t need to know how each TV works internally.
- This is polymorphism in action!
### Why Important in Data Engineering?
- Pipeline Abstraction: Process data from multiple sources (CSV, API, DB) uniformly.
- Extensibility: Add new data sources without modifying pipeline logic.
- Interchangeability: Swap readers (e.g., CSVReader → ParquetReader) seamlessly.



In [0]:
def run_reader(reader):  # Accepts any object with a read() method
    print(reader.read())  # Polymorphic call

run_reader(CSVReader())  # Output: "📄 Reading from CSV"
run_reader(APIReader())  # Output: "🌐 Fetching from API"

## Abstraction – Enforce Structure Across Team
## What is Abstraction?
- Hides complex details, exposes only what’s necessary.
## Real-Life Example: Driving a Car
### 🚗 What you see (Simplified Interface):
- Steering wheel
- Accelerator
- Brake
### 🔧 What’s hidden (Complex Internals):
- Engine combustion
- Gear mechanisms
- Fuel injection

✅ You don’t need to know how the engine works to drive!

→ This is abstraction in action.
- Enforces structure (e.g., "All ingestors must have a read() method").
Uses Abstract Base Classes (ABCs) to define rules.
## Why Important in Data Engineering?
- ✅ Team Standardization – Ensures everyone implements required methods.
- ✅ Prevents Mistakes – No incomplete classes (e.g., a DBIngestor without read()).
- ✅ Clean Contracts – "If it’s an ingestor, it must have these methods."

## Interview Answer (Simple & Powerful)
"Abstraction is like a rulebook for classes. The Ingestor ABC says: ‘If you’re an ingestor, you MUST have a read() method.’ This ensures all ingestors (CSV, API, DB) work the same way, making the code predictable and easy to extend."

In [0]:
# 1. Abstract Base Class (Blueprint)
from abc import ABC, abstractmethod

class Ingestor(ABC):  # Abstract class (cannot be instantiated)
    @abstractmethod
    def read(self):  # Must be implemented by child classes
        pass
# 2. Concrete Implementations
class CSVIngestor(Ingestor):
    def read(self):  # Must implement read() (or Python raises an error)
        return "Reading CSV..."

class APIIngestor(Ingestor):
    def read(self):  # Must implement read()
        return "Calling API..."
# 3. Polymorphic Execution    
def run(ingestor):  # Works with ANY Ingestor subclass
    print(ingestor.read())

run(CSVIngestor())  # Output: "Reading CSV..."
run(APIIngestor())  # Output: "Calling API..."   

## Real Data Pipeline: OOP Version
- (With Real-Life Analogy & Key Concepts)
### The Pipeline Structure (Like a Factory Assembly Line)
This pipeline has 3 specialized classes, each doing one job:
- Class	Real-Life Analogy	Responsibility	Method
- FileIngestor	Raw Material Supplier	Fetches raw data	read()
- Cleaner	Quality Control	Fixes errors, formats	clean()
- Writer	Packaging & Shipping	Saves final product	write()


In [0]:
# (1) FileIngestor - Gets Raw Data
class FileIngestor:
    def __init__(self, path):  # Constructor: Needs file path
        self.path = path        # Stores path as an attribute

    def read(self):            # Fetches data
        print(f"Reading data from {self.path}")
        return f"Raw data from {self.path}"  # Simulated output    
# Key Points:
# Encapsulation: path is stored internally.
# Single Responsibility: Only reads files, nothing else.

# (2) Cleaner - Processes Data
class Cleaner:
    def clean(self, data):     # Takes raw data, returns cleaned
        print("Cleaning data...")
        return f"Cleaned version of: {data}"  # Adds metadata
# Key Points:
# Reusable: Can clean data from any source (CSV, API, etc.).
# Separation of Concerns: Doesn’t know/care where data came from.

# (3) Writer - Saves Final Output
class Writer:
    def write(self, data):     # Takes cleaned data
        print(f"Writing data: {data}")
        return "Write success"  # Confirmation
# Key Points:
# Flexible: Could write to DB, cloud, etc. (extend later).
# Loose Coupling: Doesn’t depend on FileIngestor or Cleaner.

# Pipeline Execution
# Assemble the pipeline
reader = FileIngestor("data.csv")  # Step 1: Configure reader
raw_data = reader.read()          # Step 2: Extract

cleaned = Cleaner().clean(raw_data)  # Step 3: Transform
Writer().write(cleaned)              # Step 4: Load