<a href="https://colab.research.google.com/github/micah-shull/pipelines/blob/main/pipelines_06_pytorch_pipeline_02_data_prep.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Data Preparation Notebook

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

# Load the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00350/default%20of%20credit%20card%20clients.xls"
df = pd.read_excel(url, header=1)

# Rename columns to lower case and replace spaces with underscores
df.columns = [col.lower().replace(' ', '_') for col in df.columns]

# Convert specific numeric columns to categorical
categorical_columns = ['sex', 'education', 'marriage']
df[categorical_columns] = df[categorical_columns].astype('category')

# Select features and target
target = 'default_payment_next_month'
X = df.drop(columns=[target])
y = df[target]

# Perform stratified train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Identify column types
numeric_features = X_train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object', 'category']).columns.tolist()

# Define preprocessing for numeric columns
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

# Define preprocessing for categorical columns
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Fit and transform the data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Save preprocessed data
np.savez('preprocessed_data.npz', X_train_processed=X_train_processed, X_test_processed=X_test_processed, y_train=y_train, y_test=y_test)

print("Data preparation complete and saved.")


Data preparation complete and saved.


### Data Preparation Notebook Summary

#### Purpose:
The Data Preparation Notebook is dedicated to the initial steps of data handling in the machine learning workflow. It focuses on loading, cleaning, preprocessing, and splitting the dataset, ensuring that the data is ready for subsequent analysis and modeling in separate, specialized notebooks.

#### What It Does:
1. **Load the Dataset**: Reads the data from the specified source (e.g., an Excel file from a URL).
2. **Rename Columns**: Standardizes column names by converting them to lowercase and replacing spaces with underscores for consistency.
3. **Convert Data Types**: Converts specific columns to categorical types, which are more suitable for certain preprocessing steps.
4. **Select Features and Target**: Identifies the target variable and selects the relevant features for analysis.
5. **Train-Test Split**: Performs a stratified split of the dataset into training and testing sets to ensure that the class distribution is maintained in both subsets.
6. **Preprocessing Pipelines**:
   - **Numeric Features**: Imputes missing values and scales the features.
   - **Categorical Features**: Imputes missing values and applies one-hot encoding.
   - **Column Transformer**: Combines the preprocessing steps for numeric and categorical features.
7. **Transform the Data**: Applies the preprocessing pipeline to the training and testing data.
8. **Save Preprocessed Data**: Saves the transformed data to files for use in subsequent notebooks.

#### Why a Modular Approach is Preferable:
1. **Improved Readability**: Smaller notebooks focused on specific tasks are easier to read and understand. This clarity is especially beneficial for team members or collaborators who need to quickly grasp the purpose and functionality of the code.
2. **Ease of Maintenance**: Modular notebooks simplify the process of updating, debugging, and maintaining code. Changes in one part of the workflow can be managed independently without affecting the entire project.
3. **Enhanced Reusability**: Individual notebooks for data preparation, feature selection, and model training can be reused in different projects. This modularity saves time and effort when working on similar tasks in the future.
4. **Collaboration**: A modular approach facilitates collaborative work by allowing team members to work on different aspects of the project simultaneously. Each member can focus on a specific notebook without interference.
5. **Focused Analysis**: Each notebook serves a distinct purpose, allowing for a more focused and in-depth analysis of each step in the machine learning pipeline. This specialization leads to better-organized and more thorough documentation and analysis.
6. **Scalability**: As the project grows, adding new methods or analyses becomes more manageable. New notebooks can be created for additional tasks without overcomplicating the existing workflow.
7. **Professionalism**: Adopting a modular approach aligns with best practices in software development and data science. It demonstrates a methodical and organized approach to project management, enhancing the overall quality and professionalism of the work.

By adopting a modular approach, the Data Preparation Notebook lays a solid foundation for a streamlined and efficient machine learning workflow, enabling more focused and effective subsequent analysis in specialized notebooks.