## Data Cleaning and Transformation

This notebook provides practical examples of data cleaning and transformation tasks using Python libraries such as Pandas and NumPy. We'll use the Keboola library to load input data and produce outputs. 

### Key Steps:

1. **Load Input Data**: Use Keboola's CommonInterface to load data from input tables.
2. **Data Cleaning**: Handle missing values, remove duplicates, and correct data types.
3. **Data Transformation**: Normalize data, merge datasets, and create new features.
4. **Save Output Data**: Write the cleaned and transformed data back to Keboola Storage.


In [None]:
import pandas as pd
import numpy as np
from keboola.component import CommonInterface
import logging

# Initialize CommonInterface
ci = CommonInterface()

# Set up logging
logging.basicConfig(level=logging.INFO)

# Load input tables
input_tables = ci.get_input_tables_definitions()

# Check if input tables are available
if not input_tables:
    logging.error("No input tables found. Please configure Table Input Mapping in the workspace configuration.")
else:
    # Load the first input table into a DataFrame
    input_table_path = input_tables[0].full_path
    df = pd.read_csv(input_table_path)
    logging.info(f"Data loaded from {input_table_path}")

    # Display the first few rows of the DataFrame
    display(df.head())


### Load dataset from URL to follow the example

In [None]:
# URL of the Titanic dataset
titanic_url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

# Load the Titanic dataset into a pandas DataFrame
df = pd.read_csv(titanic_url)

display(df.head())


### Data Cleaning

In this section, we'll handle missing values, remove duplicates, and correct data types to ensure the data is clean and ready for analysis.

1. **Handling Missing Values**: We'll fill missing values in numeric columns with the mean of each column. For non-numeric columns, we'll fill missing values with a placeholder or drop them if necessary.
2. **Removing Duplicates**: We'll remove any duplicate rows from the DataFrame.
3. **Correcting Data Types**: We'll ensure that columns have the correct data types, such as converting columns to datetime.


In [None]:
# Handling missing values

# Fill missing values for numeric columns with the mean of the column
df.fillna(df.select_dtypes(include=[np.number]).mean(), inplace=True)

# Fill missing values for non-numeric columns with a placeholder or drop them
# Example: Replace NaN with 'unknown' for non-numeric columns
non_numeric_cols = df.select_dtypes(exclude=[np.number]).columns
df[non_numeric_cols] = df[non_numeric_cols].fillna('unknown')

# Removing duplicates
df.drop_duplicates(inplace=True)

# Correcting data types
# Replace 'date_column' with the actual column name in your dataset
if 'date_column' in df.columns:
    df['date_column'] = pd.to_datetime(df['date_column'])

logging.info("Data cleaning completed.")
display(df.head())


### Data Transformation

Next, we'll perform data normalization, merge datasets, and create new features to transform the data for better analysis.

1. **Data Normalization**: We'll normalize numeric columns to have a mean of 0 and a standard deviation of 1.
2. **Merging Datasets**: If there are multiple tables, we'll merge them on a common key.
3. **Creating New Features**: We'll create new columns based on existing data.


In [None]:
# Data normalization
# Normalize numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns
df[numeric_cols] = (df[numeric_cols] - df[numeric_cols].mean()) / df[numeric_cols].std()

# Merging datasets
# If there's another table to merge, specify the path and the common key for merging
if len(input_tables) > 1:
    second_table_path = input_tables[1].full_path
    df2 = pd.read_csv(second_table_path)
    
    # Replace 'common_key' with the actual column name used for merging
    df_merged = pd.merge(df, df2, on='common_key')
    logging.info(f"Data merged from {second_table_path}")
    display(df_merged.head())

# Creating new features
# Replace 'existing_column1' and 'existing_column2' with the actual column names in your dataset
if 'existing_column1' in df.columns and 'existing_column2' in df.columns:
    df['new_feature'] = df['existing_column1'] + df['existing_column2']

logging.info("Data transformation completed.")
display(df.head())


### Save Output Data

Finally, we'll write the cleaned and transformed data back to Keboola Storage.


In [None]:
# Define output table path
output_table_name = "cleaned_transformed_data.csv"
output_table_def = ci.create_out_table_definition(output_table_name, primary_key=['id'], incremental=False, destination=f'out.c-output.{output_table_name}')
output_table_path = output_table_def.full_path

# Save the processed DataFrame to the output path
df.to_csv(output_table_path, index=False)
logging.info(f"Processed data saved to {output_table_path}")
