# Handling Exceptions

In this notebook, we'll explore how to handle exceptions effectively. Exception handling is crucial for building robust and maintainable code, especially in complex workflows. We'll cover best practices, demonstrate how to implement them in a data science context, and illustrate advanced techniques such as using custom exceptions and ensuring clean error handling across nested functions.

## Table of Contents

1. [Basic Exception Handling](#1)
2. [Custom Exceptions](#2)
3. [Nested Functions and Exception Propagation](#3)
4. [Logging Exceptions](#4)
5. [Step-by-Step Example](#5)
6. [Exercise](#6)

---
## 1. Basic Exception Handling <a name="1"></a>

Exception handling allows your code to deal with errors gracefully. Here's a simple example of handling an exception in a data loading step.

In [None]:
import pandas as pd

def load_data(filepath):
    try:
        data = pd.read_csv(filepath)
        return data
    except FileNotFoundError:
        print(f"Error: The file at {filepath} was not found.")
    except pd.errors.EmptyDataError:
        print("Error: No data in file.")
    except Exception as e:
        print(f"An unexpected error occurred: {e}")

# Usage
data = load_data('data/raw/non_existent_file.csv')

---
## 2. Custom Exceptions <a name="2"></a>

Creating custom exceptions allows you to handle specific error conditions more gracefully.

In [None]:
class DataPipelineError(Exception):
    pass

class DataValidationError(DataPipelineError):
    def __init__(self):
        super().__init__("Data validation failed.")

class MissingValuesError(DataPipelineError):
    def __init__(self, missing_values):
        self.missing_values = missing_values
        super().__init__(
            f"Data contains {missing_values} missing values.")

# This function always raises an exception
def validate_data(data):
    missing_values = data.isnull().sum().sum()
    if  missing_values > 0:
        raise MissingValuesError(missing_values)
    else:
        raise DataValidationError

# Usage
try:
    data = load_data('train.csv')
    validate_data(data)
except MissingValuesError as e:
    print(f"Data contains {e.missing_values} missing values.")
except DataValidationError as e:
    print(f"Validation Error: {e}")

---
## 3. Nested Functions and Exception Propagation <a name="3"></a>

Handling exceptions in nested functions ensures that errors are caught and managed properly, preventing unexpected crashes.


In [None]:
class DataPreprocessingError(DataPipelineError):
    def __init__(self):
        super().__init__("Missing column during preprocessing")

def preprocess_data(data):
    try:
        # Example preprocessing step
        data['new_column'] = data['existing_column'] * 2
        return data
    except KeyError as e:
        raise DataPreprocessingError from e

def run_pipeline(filepath):
    try:
        data = load_data(filepath)
        validate_data(data)
        data = preprocess_data(data)
        return data
    except DataPreprocessingError as e:
        print(f"Data validation failed: {e}")
        raise
    except Exception as e:
        print(f"An unexpected error occurred in the pipeline: {e}")
        raise

# Usage
processed_data = run_pipeline('train.csv')

---
## 4. Logging Exceptions <a name="4"></a>

Using logging for exception handling provides a more flexible and powerful way to manage errors, especially in production environments.

In [None]:
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def load_data(filepath):
    try:
        data = pd.read_csv(filepath)
        return data
    except FileNotFoundError:
        logger.exception(f"File not found: {filepath}")
        raise
    except pd.errors.EmptyDataError:
        logger.exception("No data in file.")
        raise
    except Exception as e:
        logger.exception(f"An unexpected error occurred: {e}")
        raise

# Usage
try:
    data = load_data('data/raw/non_existent_file.csv')
except Exception as e:
    logger.critical(f"Critical error occurred: {e}")

---
## 5. Step-by-Step Example <a name="5"></a>

We'll now build a complete data science pipeline with exception handling at each step.

We start with this code:

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

def load_data(filepath):
    data = pd.read_csv(filepath)
    return data
    
def preprocess_data(data):
    X = data.drop(columns=['target'])
    y = data['target']
    return train_test_split(X, y, test_size=0.2, random_state=42)

def train_model(X_train, y_train):
    model = LinearRegression()
    model.fit(X_train, y_train)
    return model

def pipeline(datapath='train.csv'):
    # Step 1: Load the data
    data = load_data('data.csv')

    # Step 2: Preprocess the data
    X_train, X_test, y_train, y_test = preprocess_data(data)
    
    # Step 3: Train the model
    model = train_model(X_train, y_train)

    # Step 4: Evaluate the model
    predictions = model.predict(X_test)
    mse = mean_squared_error(y_test, predictions)
    print(f"Model Mean Squared Error: {mse}")

**Step 1: Setup up the logger and create a custom parent Exception**

In [14]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Create parent exception class
class ModelPipelineError(Exception):
    pass


**Step 2: Data Loading**


In [15]:
class FileMissingException(ModelPipelineError):
    def __init__(self, filepath):
        self.filepath = filepath
        super().__init__(
            f"The file {self.filepath} couldn't be found.")
    
def load_data(filepath):
    try:
        data = pd.read_csv(filepath)
        return data
    except FileNotFoundError as e:
        raise FileMissingException(filepath) from e

**Step 3: Data Preprocessing**

In [17]:
class TargetMissingException(ModelPipelineError):
    def __init__(self):
        super().__init__("The target is missing in the dataset.")

def preprocess_data(data):
    try:
        if 'target' not in data.columns:
            raise TargetMissingException
        data = data.dropna()  # Handle missing values
        X = data.drop(columns=['target'])
        y = data['target']
        return train_test_split(X, y, test_size=0.2, random_state=42)
    except KeyError as e:
        raise TargetMissingException from e

**Step 4: Model Training**

In [None]:
class ModelTrainingException(ModelPipelineError):
    def __init__(self):
        super().__init__("The model was provided non allowed values.")

def train_model(X_train, y_train):
    try:
        model = LinearRegression()
        model.fit(X_train, y_train)
        return model
    except ValueError as e:
        raise ModelTrainingException from e

**Step 5: Evaluating the model**

In [None]:
class ModelEvaluationException(ModelPipelineError):
    def __init__(self):
        super().__init__("The evaluation of the mode failed.")

def evaluate_model(model, X_test, y_test, predictions):
    try:
        predictions = model.predict(X_test)
        mse = mean_squared_error(y_test, predictions)
        print(f"Model Mean Squared Error: {mse}")
    except Exception as e:
        raise ModelEvaluationException from e 

**Step 6: Running the Pipeline**

In [None]:
def pipeline(datapath, predictions):
    try:
        # Step 1: Load the data
        data = load_data(datapath)

        # Step 2: Preprocess the data
        X_train, X_test, y_train, y_test = preprocess_data(data)
    
        # Step 3: Train the model
        model = train_model(X_train, y_train)

        # Step 4: Evaluate the model
        predictions = model.predict(X_test)
        mse = mean_squared_error(y_test, predictions)
        print(f"Model Mean Squared Error: {mse}")

    except FileMissingException as e:
        logger.exception("A problem was found while loading the data.")
        raise
    except TargetMissingException as e:
        logger.exception("A problem was found while preparing the data.")
        raise
    except ModelTrainingException as e:
        logger.exception("A problem was found while training the data.")
        raise
    except ModelEvaluationException as e:
        logger.exception("Error during model evaluation.")
        raise

---
## 6. Exercise <a name="6"></a>
**Task**
You are provided with a simple data science pipeline that loads data, validates it, preprocesses it, and trains a model. The pipeline currently does not have any exception handling. Your task is to:
 - Add exception handling to each step of the pipeline.
 - Use custom exceptions where appropriate.
 - Implement logging for all exceptions.

**Initial Code**


In [None]:

import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

def load_data(filepath):
    data = pd.read_csv(filepath)
    return data

def validate_data(data):
    if data.isnull().sum().sum() > 0:
        print("Data contains missing values.")

def preprocess_data(data):
    data['new_column'] = data['existing_column'] * 2
    return data

def train_model(data):
    X = data[['new_column']]
    y = data['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    return model

def run_pipeline(filepath):
    data = load_data(filepath)
    validate_data(data)
    data = preprocess_data(data)
    model = train_model(data)
    return model

# Usage
model = run_pipeline('data/raw/example.csv')
print(model)


Requirements
- Handle file not found errors in load_data.
- Raise a custom exception for validation errors in validate_data.
- Handle missing column errors in preprocess_data.
- Handle any errors during model training in train_model.
- Log all exceptions with appropriate severity levels.

Write your solution here:

In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

def load_data(filepath):
    data = pd.read_csv(filepath)
    return data

def validate_data(data):
    if data.isnull().sum().sum() > 0:
        print("Data contains missing values.")

def preprocess_data(data):
    data['new_column'] = data['existing_column'] * 2
    return data

def train_model(data):
    X = data[['new_column']]
    y = data['target']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    model = LinearRegression()
    model.fit(X_train, y_train)
    return model

def run_pipeline(filepath):
    data = load_data(filepath)
    validate_data(data)
    data = preprocess_data(data)
    model = train_model(data)
    return model

# Usage
model = run_pipeline('data/raw/example.csv')
print(model)


**Solution**

(careful, the solution is not correct, it's still being reviewed)

In [None]:
import pandas as pd
import logging
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class DataValidationError(Exception):
    pass

def load_data(filepath):
    try:
        data = pd.read_csv(filepath)
        return data
    except FileNotFoundError:
        logger.error(f"File not found: {filepath}")
        raise
    except pd.errors.EmptyDataError:
        logger.error("No data in file.")
        raise
    except Exception as e:
        logger.error(f"An unexpected error occurred: {e}")
        raise

def validate_data(data):
    try:
        if data.isnull().sum().sum() > 0:
            raise DataValidationError("Data contains missing values.")
    except DataValidationError as e:
        logger.warning(f"Validation error: {e}")
        raise

def preprocess_data(data):
    try:
        data['new_column'] = data['existing_column'] * 2
        return data
    except KeyError as e:
        logger.error(f"Missing column during preprocessing: {e}")
        raise DataValidationError(f"Preprocessing error: {e}")

def train_model(data):
    try:
        X = data[['new_column']]
        y = data['target']
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
        model = LinearRegression()
        model.fit(X_train, y_train)
        return model
    except KeyError as e:
        logger.error(f"Missing target column: {e}")
        raise DataValidationError(f"Training error: {e}")
    except Exception as e:
        logger.error(f"An error occurred during model training: {e}")
        raise

def run_pipeline(filepath):
    try:
        data = load_data(filepath)
        validate_data(data)
        data = preprocess_data(data)
        model = train_model(data)
        return model
    except DataValidationError as e:
        logger.error(f"Pipeline failed: {e}")
    except Exception as e:
        logger.critical(f"Critical error in pipeline: {e}")

# Usage
try:
    model = run_pipeline('data/raw/example.csv')
    print(model)
except Exception as e:
    logger.critical(f"Pipeline execution failed: {e}")