# Hands-on: Data Preparation

## Overview

In this hands-on activity, you will import your dataset (a CSV file) into a notebook to prepare the data. The objective is to guarantee the data quality by pre-processing the raw data before it is utilized in analytics or used as ML model's training dataset.

You will learn about:
1. Process null columns 
2. Process duplicated rows
3. Process outliers
4. Derive new columns 
5. Save cleansed data as new CSV

Sample data: https://ibm.box.com/v/hotel-bookings-sample-dataset

Original data source: 
https://www.sciencedirect.com/science/article/pii/S2352340918315191
https://www.kaggle.com/datasets/jessemostipak/hotel-booking-demand

## Setup

In [None]:
# Import library
import pandas as pd
import numpy
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter("ignore")
pd.set_option('display.max_columns', None)

## Load Data
- Replace this part with your own code. To insert Code Snippet for Data Ingestion, click '</>' icon located in the top-right menu.

In [None]:
# Replace this part with your own code
import os, types
import pandas as pd
from botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

# @hidden_cell
# The following code accesses a file in your IBM Cloud Object Storage. It includes your credentials.
# You might want to remove those credentials before you share the notebook.

cos_client = ibm_boto3.client(service_name='s3',
    ibm_api_key_id='PjXGOLvd9BTXHT3f_wi2ujiwywR5hnfK7tAJkfmahpxu',
    ibm_auth_endpoint="https://iam.cloud.ibm.com/oidc/token",
    config=Config(signature_version='oauth'),
    endpoint_url='https://s3.private.us-south.cloud-object-storage.appdomain.cloud')

bucket = 'mlpredictivemodel-donotdelete-pr-se3ulnjuojrkgg'
object_key = 'hotel_bookings.csv'

body = cos_client.get_object(Bucket=bucket,Key=object_key)['Body']
# add missing __iter__ method, so pandas accepts body as file-like object
if not hasattr(body, "__iter__"): body.__iter__ = types.MethodType( __iter__, body )

df = pd.read_csv(body)
df.head(10)

In [None]:
# Show number of columns and rows
df.shape

In [None]:
# Show all columns name
df.columns

In [None]:
# Show reservation status unique values
df['reservation_status'].unique()

## 1. Process Null Columns

In [None]:
# Check null
df.isna().sum()

In [None]:
# Check columns with null
features = ['children', 'country', 'agent', 'company']

for feat in features:
    perc = len(df[df[feat].isna()])/len(df)*100
    perc = round(perc, 1)
    print(f'Null in {feat}:', perc, '%')
    print(df[feat].describe(), '\n')

In [None]:
# For 'children', impute the null with median value
if df['children'].notna().any():
    mode = df['children'].mode()[0]
    df['children'].fillna(value=mode, inplace=True)
    
# For 'country', impute the null with most frequent value
if df['country'].notna().any():
    mode = df['country'].mode()[0]
    df['country'].fillna(value=mode, inplace=True)

# For 'agent', impute the null with most frequent value
if df['agent'].notna().any():
    mode = df['agent'].mode()[0]
    df['agent'].fillna(value=mode, inplace=True)
    
# For 'company', drop the column due to large number of null
df.drop('company', axis=1, inplace=True)

## 2. Process Duplicated Rows

In [None]:
# Check duplicated data
df[df.duplicated()] 
# -> Since there is no unique reservations ID and the duplicated number of rows is significant, keep the data.

## 3. Process Outliers

In [None]:
# Check outliers in original data
df_describe = pd.DataFrame(df.describe(include='all'))
df_describe

In [None]:
# Process outliers in original data

# Impute outlier value with 0
df.loc[df['adults']>4, 'adults'] = 0
df.loc[df['children']>4, 'children'] = 0
df.loc[df['babies']>4, 'babies'] = 0

# 'Meal' contains values "Undefined", which is equal to SC
df['meal'].replace('Undefined', 'SC', inplace=True)

## 4. Derive New Columns

In [None]:
# Derive new columns

# Create 'total_stay_nights'
df['total_stay_nights'] = df['stays_in_week_nights'] + df['stays_in_weekend_nights']

# Create 'kids'& 'num_pax'
df['kids'] = df['children'] + df['babies'] 
df['num_pax'] = df['adults'] + df['kids'] 

In [None]:
# Check outliers in derived data
df_der = pd.DataFrame(df[['total_stay_nights', 'num_pax']].describe())
df_der

In [None]:
print(df[df['num_pax']==0].shape)
# Drop the rows if 'num_pax' == 0
df = df[df['num_pax']!=0]
df.shape
# -> Remove the rows since there is no data about pax and the row number is not significant

## 5. Save Cleansed Data as New CSV

In [None]:
#The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform API
from project_lib import Project

project = Project(None, '<my_project_id>', '<my_project_token>')
pc = project.project_context

# Show Project, Bucket and Assets
print('Project Name: {0}'.format(project.get_name()))
print('Project Description: {0}'.format(project.get_description()))
print('Project Bucket Name: {0}'.format(project.get_project_bucket_name()))
print('Project Assets (Connections): {0}'.format(project.get_assets(asset_type='connection')))

# Save dataframe as csv file in your bucket 
project.save_data(data=df.to_csv(index=False), file_name='hotel_bookings_v1.csv', overwrite=True)

## Summary 

In this hands-on activity, you have covered the following:

1. Checked the quality of the data.
2. Conducted data wrangling to ensure datasets were of acceptable quality for use in exploratory data analysis (EDA) and Machine Learning (ML) model development.
3. Saved the cleansed dataset into a new CSV file.