The following is from [this article](https://medium.com/towards-data-science/object-oriented-data-science-refactoring-code-5bcb4ae7ce72) in Medium.

For data scientists, code is the backbone of analysis and decision-making. As data science applications grow more intricate, from machine learning models embedded in software to complex data pipelines orchestrating vast amounts of information, developing clean, organized, and maintainable code becomes crucial. Object-oriented programming (OOP) unlocks flexibility and efficiencies that enable data scientists to respond to changing requirements with agility. OOP introduces the concept of classes, which serve as blueprints for creating objects that encapsulate both data and the operations that manipulate it. This paradigm shift allows data scientists to go beyond traditional functional approaches, promoting modular design and code reusability.

In this article, we’ll explore the benefits of refactoring data science code by creating classes and deploying object-oriented techniques, and how this approach can enhance modularity and reusability.

# The Power of Classes in Data Science

In traditional data science workflows, functions have been the approach for encapsulating logic. This is often sufficient as functions allow developers to minimize repeated code. However, as projects evolve, maintaining an extensive collection of functions might lead to code that’s challenging to navigate, debug, and scale.

This is where classes come into play. A class is a blueprint for creating objects, which bundle both data and functions (called methods) that operate on that data. By organizing code into classes, developers can achieve several advantages:

1. Modularity and Encapsulation: Classes promote modularity by grouping related functionality together. Each class encapsulates its own attributes (data) and methods (functions), reducing the risk of global variable pollution and the potential for naming conflicts. This helps maintain a clear separation of concerns, making code easier to understand and modify.
2. Reusability: Classes encourage reusability by providing a consistent interface for similar tasks across different parts of the project. Once a class is defined, it can be instantiated whenever needed and its methods can be used to achieve consistent results.
3. Inheritance and Polymorphism: Inheritance allows developers to create subclasses that inherit attributes and methods from a parent class. This promotes code reuse while enabling customization for specific tasks. Polymorphism, another OOP concept, lets developers use the same method name across different classes, adapting behavior based on the specific implementation.
4. Testing and Debugging: Classes facilitate unit testing, as test cases can target individual methods within a class, making it easier to identify and fix issues, improving the overall robustness of your codebase.

# Refactoring to Classes: A Theoretical Example

Let’s consider a scenario where you’re working on a machine learning project that involves data preprocessing, model training, and evaluation. Initially, you might have a collection of functions for each step:

In [1]:
# Example: Using functions for data preprocessing


def load_data(file_path):
    # Load and preprocess data
    ...


def preprocess_data(data):
    # Clean, transform, and encode data
    ...


def train_model(preprocessed_data):
    # Train a machine learning model
    ...


def evaluate_model(trained_model, test_data):
    # Evaluate model performance
    ...

While functional decomposition works, over time, there may be many steps that occur within the preprocessing, training, and evaluation. This can become challenging to manage these functions.

Refactoring the code into classes:

In [2]:
class DataPreprocessor:
    def __init__(self, file_path):
        self.data = self.load_data(file_path)
        self.preprocessed_data = self.preprocess_data()

    def load_data(self, file_path):
        # Load and preprocess data
        ...

    def clean_data(self):
        # imputation, outlier treatment
        ...

    def transform_data(self):
        # transformations and encode data
        ...


class ModelTrainer:
    def __init__(self, preprocessed_data):
        self.model = self.train_model(preprocessed_data)

    def fit(self, preprocessed_data):
        # Train a machine learning model
        ...

    def predict(self, preprocessed_data):
        # Predict using the machine learning model
        ...


class ModelEvaluator:
    def __init__(self, trained_model, test_data):
        self.performance_metrics = self.evaluate_model(trained_model, test_data)

    def evaluate_model(self, trained_model, test_data):
        # Evaluate model performance
        ...

    def calculate_rmse(self, trained_model, test_data):
        # Evaluate root mean squared error
        ...

    def calculate_r_squared(self, trained_model, test_data):
        # Evaluate r_squared of the model
        ...

By breaking down the workflow into classes, there is more organization, and the structure is easier to read and maintain. Each class handles a specific aspect of the process. They can be instantiated as:

In this case, the classes have incorporating classes provides an extra layer of structure and flexibility that improve the workflow and usability of the code. By embracing the power of classes, this example creates a more robust and scalable code base.

# Refactoring to Classes: A Practical Example

As a practical example, I recently refactored code that was developed in [this repository](https://github.com/mollyryanruby/sales_forecasting) 3 years ago into [a new repository](https://github.com/mollyryanruby/auto_forecast) to show the difference in the code before and after refactoring.

In the initial repository, many functions encompassed modeling tasks as several different models were trained and tested. In the refactored version, there is a model class SalesForecasting that encompasses all the modeling tasks. This is easier to read and allows the package to be deployed more efficiently as SalesForecasting and be instantiated multiple times with different inputs. As a preview the class looks like this:

In [3]:
class SalesForecasting:
    """
    SalesForecasting class to train and predict sales using a variety of models.
    """

    def __init__(self, model_list):
        """
        Initialize the SalesForecasting class with a list of models to train and predict.

        Args:
            model_list (list): list of models to train and predict. Options include:
                - LinearRegression
                - RandomForest
                - XGBoost
                - LSTM
                - ARIMA

        Returns:
            None
        """

        ...

    def fit(self, X_train, y_train):
        """
        Fit the models in model_dict to the training data.

        Args:
            X_train (pd.DataFrame): training data exogonous features for the model
            y_train (pd.Series): training data target for the model

        Returns:
            None
        """

        ...

    def __fit_regression_model(self, model):
        """
        Fit a regression model to the training data.

        Args:
            model (sklearn model): sklearn model to fit to the training data

        Returns:
            model (sklearn model): fitted sklearn model
        """
        ...

    def __fit_lstm_model(self, model):
        """
        Fit an LSTM model to the training data.

        Args:
            model (keras model): keras model to fit to the training data

        Returns:
            model (keras model): fitted keras model
        """

        ...

    def __fit_arima_model(self, model_name):
        """
        Fit an ARIMA model to the training data.

        Args:
            model_name (str): name of the model to fit to the training data

        Returns:
            model (pmdarima model): fitted pmdarima model
        """
        ...

    def predict(self, x_values, y_values=None, scaler=None, print_scores=False):
        """
        Predict values using the models in model_dict.

        Args:
            x_values (pd.DataFrame): exogenous features to predict on
            y_values (pd.Series): target values to compare predictions against
            scaler (sklearn scaler): scaler used to scale the data
            print_scores (bool): whether to print the scores for each model

        Returns:
            self (SalesForecasting): self with updated predictions
        """

        ...

    def __predict_regression_model(self, model):
        """
        Predict values using a regression model.

        Args:
            model (sklearn model): sklearn model to predict with

        Returns:
            predictions (np.array): array of predictions
        """
        ...

    def __predict_lstm_model(self, model):
        """
        Predict values using an LSTM model.

        Args:
            model (keras model): keras model to predict with

        Returns:
            predictions (np.array): array of predictions
        """
        ...

    def __predict_arima_model(self, model):
        """
        Predict values using an ARIMA model.

        Args:
            model (pmdarima model): pmdarima model to predict with
        Returns:
            predictions (np.array): array of predictions
        """
        ...

    def __undo_scaling(self, values, scaler):
        """
        Undo scaling on a set of values.

        Args:
            values (np.array): array of values to unscale
            scaler (sklearn scaler): scaler to use to unscale the values

        Returns:
            unscaled_values (np.array): array of unscaled values
        """
        ...

    def get_scores(self, y_pred, y_true, model_name=None, print_scores=False):
        """
        Get the scores for a model. Scores include RMSE, MAE, and R2.

        Args:
            y_pred (np.array): array of predicted values
            y_true (np.array): array of true values
            model_name (str): name of the model to get scores for
            print_scores (bool): whether to print the scores for the model

        Returns:
            rmse (float): root mean squared error
            mae (float): mean absolute error
            r2 (float): r squared
        """
        ...

    def plot_results(
        self,
        model_list=None,
        figsize=(13, 3),
        xlabel="Date",
        ylabel="Sales",
        title="Sales Forecasting Predictions",
    ):
        """
        Plot the results of the predictions against the actual values.
        Generates a timeseries for predictions from each model in model_dict.

        Args:
            model_list (list): list of models to plot. If None, plots all models in model_dict
            figsize (tuple): tuple of figure size
            xlabel (str): label for x axis
            ylabel (str): label for y axis
            title (str): title for the plot

        Returns:
            fig (matplotlib figure): figure with the plot
        """

        ...

    def plot_errs(self, figsize=(13, 3)):
        """
        Plot the errors for each model in model_dict. Errors include RMSE, MAE, and R2.

        Args:
            figsize (tuple): tuple of figure size

        Returns:
            fig (matplotlib figure): figure with the plot
        """
        ...

The class “SalesForecasting” serves as a comprehensive blueprint for data-driven businesses to anticipate future sales trends through the application of various predictive models. Within this class, data scientists can harness the power of different modeling techniques, including Linear Regression, Random Forest, XGBoost, LSTM (Long Short-Term Memory), and ARIMA (AutoRegressive Integrated Moving Average). By encapsulating the forecasting workflow within this class, the process of model fitting, prediction, and evaluation becomes streamlined and consistent across different model types. Through the “SalesForecasting” class, data scientists can efficiently experiment with different algorithms and easily maintain the code base.

Object-oriented programming is a tool for data scientists to architect code that mirrors the intricacies of the real-world systems they analyze, enabling them to extract valuable insights while maximizing agility. Although python intends for classes to be used for instantiation and inheritance, the example above shows a first step in which classes are leveraged for modularizing code. As data science capabilities expand and teams grow, maintaining efficient code is essential.