# Data Preparation for Machine Learning

Welcome to this data preparation notebook. This notebook will guide you through the steps required to prepare your dataset for machine learning. The goal is to ensure that your data is clean, well-structured, and ready for modeling. We will cover the following steps:

1. Importing necessary packages
2. Loading the dataset
3. Exploring and profiling the data
4. Cleaning and preprocessing the data
    - Removing redundant columns
    - Solving duplicates
    - Solving missing values
    - Encoding categorical variables
    - Feature scaling, Dimensionality reduction, Splitting dataset
5. Exporting transformed dataset



---

## Importing Necessary Packages

In this step, we will import all the necessary Python packages that will be used throughout the notebook. These packages include libraries for data manipulation, visualization, and profiling.



In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
import os
import matplotlib
import sys
import site
import logging

%matplotlib inline
sys.path.append(site.getusersitepackages())

from scipy.stats import norm
from scipy import stats
from ipywidgets import widgets, Layout
from keboola.component import CommonInterface

warnings.filterwarnings('ignore')

---
## Selecting a Dataset

In this step, we will list the tables that users have loaded into the workspace using the table input mapping. Users can then select the dataset they want to use for data profiling and exploration.

The input datasets are loaded using the Keboola Common Interface, which allows seamless interaction with the data tables defined in the workspace.


In [None]:
# Initialize CommonInterface
ci = CommonInterface()

# Load input tables
input_tables = ci.get_input_tables_definitions()

# List all CSV files in the input tables directory
table_list = []
for table in input_tables:
    table_list.append(table.full_path)

# Create a dropdown widget for selecting a table
if table_list:
    logging.info("Select the dataset you want to use from the dropdown.")
    tables = widgets.Dropdown(options=table_list, value=table_list[0],
                              description='Table:', disabled=False)
    display(tables)
else:
    logging.warning("No tables found. Please ensure you have loaded tables into the workspace using the table input mapping.")


### Load Selected Dataset
Once you have selected a dataset from the dropdown, this cell reads the CSV file into a pandas DataFrame and generates a profile report using the `ydata-profiling` package.


In [None]:
data = pd.read_csv(tables.value)
profile = ProfileReport(data)
display(data.head())

## Alternatively Load Dataset from URL to follow the example

In [None]:
# URL of the Titanic dataset
titanic_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

# Load the Titanic dataset into a pandas DataFrame
data = pd.read_csv(titanic_url)

display(data.head())

---
## Removing Redundant Columns

We will now identify and remove columns that have only one unique value, as such columns are not useful for machine learning. You will be asked to confirm before any columns are removed.


In [None]:
# Identify columns with only one unique value
redundant_columns = [col for col in data.columns if data[col].nunique() <= 1]

# Display redundant columns and ask for confirmation to drop
if redundant_columns:
    print(f"The following columns have only one unique value and can be considered redundant: {redundant_columns}")
    drop_redundant = widgets.ToggleButtons(
        options=['Yes', 'No'],
        description='Drop Columns?',
        disabled=False,
        button_style=''
    )
    display(drop_redundant)
else:
    print("No redundant columns found.")


In [None]:
# Drop redundant columns based on user confirmation
if redundant_columns and drop_redundant.value == 'Yes':
    data.drop(columns=redundant_columns, inplace=True)
    print(f"Dropped columns: {redundant_columns}")
else:
    print("No columns were dropped.")


---
## Solving Duplicate Rows

We will identify duplicate rows in the dataset. You will be asked to confirm before any duplicates are removed.


In [None]:
# Identify duplicate rows
duplicate_rows = data.duplicated().sum()

# Display duplicate rows count and ask for confirmation to drop
if duplicate_rows > 0:
    print(f"There are {duplicate_rows} duplicate rows in the dataset.")
    drop_duplicates = widgets.ToggleButtons(
        options=['Yes', 'No'],
        description='Drop Duplicates?',
        disabled=False,
        button_style=''
    )
    display(drop_duplicates)
else:
    print("No duplicate rows found.")


In [None]:
# Drop duplicate rows based on user confirmation
if duplicate_rows > 0 and drop_duplicates.value == 'Yes':
    data.drop_duplicates(inplace=True)
    print(f"Dropped {duplicate_rows} duplicate rows.")
else:
    print("No duplicate rows were dropped.")


---
## Solve Missing Values

### Identify Missing Values

In this section, we will identify the missing values in the dataset. This will help us understand the extent of missing data and decide on an appropriate action to handle it.


In [None]:
# Function to identify missing values
def getMissing(data):
    missing_cnt = data.isna().sum().sum()
    missing_pct = missing_cnt / (len(data.columns) * len(data))     
    missing_out = data.isna().sum()
    
    print('=====================================')
    print(f'Total missing cells: [{missing_cnt}]')
    print(f'Percentage of missing cells: [{missing_pct:.2%}]')
    print('=====================================')
    print('Count of missing cells per column:')
    print(missing_out)
    print('=====================================')
    print('-------------------------------------')

# Identify missing values in the dataset
getMissing(data)


### Decide How to Handle Missing Values

Choose a missing action from the following options:
- **"drop"**: Drop rows with missing values in the selected column(s).
- **"replace"**: Replace missing numeric values with the MEAN and missing categorical values with a new category named "Undefined" for the selected columns.
- **"replaceNumeric"**: Replace missing numeric values with the MEAN value for the selected columns.
- **"replaceCategorical"**: Replace missing categorical values with a new category named "Undefined" for the selected columns.
- **"None"**: Ignore missing values.

<h3><font color="red">↓↓↓ Execute the cell below and choose how to solve missing values ↓↓↓</font></h3>


In [None]:
# Display widgets to choose how to handle missing values
if data.isna().sum().sum() > 0:
    MISSING_ACTION = widgets.ToggleButtons(
        options=['None', 'drop', 'replace', 'replaceNumeric', 'replaceCategorical'],
        description='Action:',
        disabled=False,
        button_style='info',  # 'success', 'info', 'warning', 'danger' or ''
        value='None'
    )
    COLUMNS_ACTION = widgets.SelectMultiple(
        options=['ALL COLUMNS'] + list(data.columns),
        description='Columns:',
        ensure_option=True,
        disabled=False,
        rows=15
    )

    display(MISSING_ACTION)
    display(COLUMNS_ACTION)
else:
    logging.info('[INFO] There are no missing values in your dataset.')

### Apply Missing Values Action

Execute the cell below to apply the chosen action for handling missing values.

<i><b>NOTE:</b> You can select and execute the missing action multiple times.</i><br>
<i>For example, you can first select 'drop' for a specific column and then 'replaceNumeric' for numeric columns you prefer not to drop.</i>


In [None]:
import ast

# Function to handle missing values based on selected action
def solveMissing(data, MISSING_ACTION):
    messageOut = []
    allColumns = list(data.columns)
    datePreds = []
    categoricalPreds = []
    numericPreds = []
    
    for predictor in allColumns:
        if data[predictor].dtype == 'object':
            try:
                pd.to_datetime(data[predictor])
                datePreds.append(predictor)
            except:
                categoricalPreds.append(predictor) 
        elif 'datetime' in str(data[predictor].dtype):
            datePreds.append(predictor)
        else:
            numericPreds.append(predictor)
    
    if 'None' in MISSING_ACTION[:4]:
        messageOut.append('Not solving any columns.')
        
    if MISSING_ACTION == "replaceAll":
        messageOut.append('Replacing missing values in all columns:')
        for col in allColumns:
            if data[col].isna().sum() > 0:
                if col in numericPreds:
                    data[col].fillna(data[col].mean(), inplace=True)
                else:
                    data[col].fillna('REPLACED-Undefined', inplace=True)
                messageOut.append(col)
            
    elif "replaceNumeric" in MISSING_ACTION:
        messageOut.append('Replacing missing values in NUMERIC columns:')
        for col in numericPreds:
            if data[col].isna().sum() > 0:
                data[col].fillna(data[col].mean(), inplace=True)            
                messageOut.append(col)
            
    elif "replaceCategorical" in MISSING_ACTION:
        messageOut.append('Replacing missing values in CATEGORICAL columns:')
        for col in categoricalPreds:
            if data[col].isna().sum() > 0:
                data[col].fillna('REPLACED-Undefined', inplace=True)
                messageOut.append(col)
    
    elif "replace" in MISSING_ACTION:
        messageOut.append('Replacing missing values in selected columns.')
        colsToReplace = ast.literal_eval(MISSING_ACTION.replace("replace", ""))
        for col in colsToReplace:
            if col in categoricalPreds:
                if data[col].isna().sum() > 0:
                    data[col].fillna('REPLACED-Undefined', inplace=True)
                    messageOut.append(col)
            else:
                if data[col].isna().sum() > 0:
                    data[col].fillna(data[col].mean(), inplace=True)            
                    messageOut.append(col)
                        
    if MISSING_ACTION == 'dropAll':
        messageOut.append('Dropping missing values in all columns.')
        data.dropna(inplace=True)
            
    elif "drop" in MISSING_ACTION[:4]:
        messageOut.append('Dropping missing values in selected columns.')
        colsToDrop = ast.literal_eval(MISSING_ACTION.replace("drop", ""))
        data.dropna(subset=colsToDrop, inplace=True)
        messageOut.append(colsToDrop)
    
    if len(messageOut) == 0:
        messageOut.append('[INFO] There is nothing to do for selected action.')
    print(messageOut)
    return data

# Apply the chosen action for handling missing values
missing_action_value = MISSING_ACTION.value
columns_action_value = list(COLUMNS_ACTION.value)
if 'ALL COLUMNS' in columns_action_value:
    missing_action_concat = missing_action_value + 'All'
else:
    missing_action_concat = missing_action_value + str(columns_action_value)

data = solveMissing(data, missing_action_concat)


In [None]:
display(data.head())

---
## Identifying Data Types and Encoding Categorical Variables

In this section, we will identify the data types of each column and suggest methods to encode categorical variables. Encoding categorical variables is a crucial step for machine learning as most algorithms require numerical input.

### Identify Data Types

We will first identify and display the data types of each column in the dataset.


In [None]:
# Identify data types of each column
data_types = data.dtypes
print("Data types of each column:")
print(data_types)


### Suggest Encoding Methods for Categorical Variables

We have identified the categorical variables in the dataset. There are multiple methods to encode these variables:

- **One-Hot Encoding**: Creates a new binary column for each unique category.
- **Label Encoding**: Converts each category to a unique integer.

You can select the encoding method for each categorical variable from the dropdown menus below.


In [None]:
# Identify categorical columns
categorical_columns = [col for col in data.columns if data[col].dtype == 'object']

# Create dropdown widgets for selecting encoding methods
encoding_methods = ['One-Hot Encoding', 'Label Encoding']
encoding_dropdowns = {col: widgets.Dropdown(options=encoding_methods, description=f'{col}:', disabled=False) for col in categorical_columns}

# Display dropdown widgets
for col, dropdown in encoding_dropdowns.items():
    display(dropdown)


### Apply Selected Encoding Methods

Based on your selections above, we will encode the categorical variables using the chosen methods.


In [None]:
# Apply selected encoding methods
from sklearn.preprocessing import LabelEncoder

for col, dropdown in encoding_dropdowns.items():
    method = dropdown.value
    if method == 'One-Hot Encoding':
        data = pd.get_dummies(data, columns=[col], drop_first=True)
    elif method == 'Label Encoding':
        le = LabelEncoder()
        data[col] = le.fit_transform(data[col])

print("Encoding applied. Here is the updated dataset:")
display(data.head())


---
## Additional Steps Before Machine Learning

To further prepare your dataset for machine learning, we will perform the following steps:

1. **Select the Target Variable**: Identify the target variable, which is the variable we aim to predict.
2. **Feature Scaling**: Normalize or standardize numerical features to ensure they have the same scale. This helps improve the performance of many machine learning algorithms that are sensitive to the scale of input data.
3. **Feature Engineering**: Create new features from existing ones to help improve model performance. This step is optional and depends on the specific dataset and problem.
4. **Dimensionality Reduction**: Reduce the number of features if you have a high-dimensional dataset, using techniques like PCA (Principal Component Analysis). This helps to reduce computational cost and can improve model performance.
5. **Splitting the Data**: Split the dataset into training and testing sets to evaluate model performance. This helps in validating how well the model generalizes to unseen data.
6. **Saving the Final Transformed Dataset**: Save the final transformed dataset to a CSV file for use in the machine learning notebook.


In [None]:
# Widget to select target column
target_column_widget = widgets.Dropdown(
    options=data.columns.tolist(),
    description='Target Column:',
    disabled=False
)
display(target_column_widget)


In [None]:
# Select target column
target_column = target_column_widget.value
print(f"Target column selected: {target_column}")

# Separate target column from features
X = data.drop(columns=[target_column])
y = data[target_column]

# Ensure target remains binary if it is binary
if y.nunique() == 2:
    print("Target variable is binary and will remain unchanged.")
else:
    print("Target variable is not binary.")


### Feature Scaling

Normalize or standardize numerical features to ensure they have the same scale. This helps improve the performance of many machine learning algorithms that are sensitive to the scale of input data.


In [None]:
from sklearn.preprocessing import StandardScaler

# Feature Scaling
scaler = StandardScaler()
numerical_columns = X.select_dtypes(include=[np.number]).columns.tolist()
X[numerical_columns] = scaler.fit_transform(X[numerical_columns])
print("Feature scaling applied to numerical columns.")


### Feature Engineering (Example: Creating a new feature based on existing ones)
 - Add any feature engineering steps here if applicable - this always has to be done specifically for your dataset so we don't provide any code
 - Typically you can calculate some aggregated values such as Count of orders in last 7 days, last 30 days, last 90 days etc. for every customer
     - That gives you already 3 new features - it's obvious that there can be millions of valid features and feature engineering is very complex area



### Dimensionality Reduction

Reduce the number of features if you have a high-dimensional dataset, using techniques like PCA (Principal Component Analysis). This helps to reduce computational cost and can improve model performance.


In [None]:
from sklearn.decomposition import PCA

# Dimensionality Reduction (Optional)
pca = PCA(n_components=0.95)  # Retain 95% of variance
X_reduced = pca.fit_transform(X[numerical_columns])
X_pca = pd.DataFrame(X_reduced, columns=[f'PC{i+1}' for i in range(X_reduced.shape[1])])

# Combine PCA components with non-numerical columns (if any)
non_numerical_data = X.drop(columns=numerical_columns)
X_final = pd.concat([non_numerical_data.reset_index(drop=True), X_pca.reset_index(drop=True)], axis=1)
print("Dimensionality reduction applied (if applicable).")


In [None]:
display(data.head())

### Splitting the Data

Split the dataset into training and testing sets to evaluate the model performance. You can select the size of the test set from the options below.


In [None]:
# Widget to select test size
test_size_widget = widgets.FloatSlider(
    value=0.2,
    min=0.1,
    max=0.5,
    step=0.1,
    description='Test Size:',
    continuous_update=False,
    orientation='horizontal'
)
display(test_size_widget)


In [None]:
from sklearn.model_selection import train_test_split

# Splitting the Data
test_size = test_size_widget.value

X_train, X_test, y_train, y_test = train_test_split(X_final, y, test_size=test_size, random_state=42)
print(f"Data split into training and testing sets with test size = {test_size}.")


### Saving the Final Transformed Dataset

We will save the final transformed dataset to a CSV file for use in the machine learning notebook.


In [None]:
# Save the final transformed dataset
final_dataset = pd.concat([X_final, y.reset_index(drop=True)], axis=1)
final_dataset.to_csv('/data/final_transformed_dataset.csv', index=False)
print("Final transformed dataset saved to /data/final_transformed_dataset.csv")

# Save the split train and test datasets
train_dataset = pd.concat([X_train.reset_index(drop=True), y_train.reset_index(drop=True)], axis=1)
test_dataset = pd.concat([X_test.reset_index(drop=True), y_test.reset_index(drop=True)], axis=1)

train_dataset.to_csv('/data/train_dataset.csv', index=False)
print("Training dataset saved to /data/train_dataset.csv")

test_dataset.to_csv('/data/test_dataset.csv', index=False)
print("Testing dataset saved to /data/test_dataset.csv")
