<center>
  <a href="MLSD-02-DataPreprocessing-C.ipynb" target="_self">Data Preprocessing C</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-E.ipynb">Data Preprocessing E | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>

# <center>DATA PREPROCESSING D</center>

<center><b>Copyright &copy 2023 by DR DANNY POO</b><br> e:dannypoo@nus.edu.sg<br> w:drdannypoo.com</center><br>

# Data Preprocessing
<b>Dataset</b>: Hotel data set.<br>
<b>Tasks</b>: 
- To read in data set in pandas.
- To convert date into numerical values.
- To fill missing data with some values. 
- To fill missing data with mean values. 
- To drop columns that are not useful.

![image.png](attachment:image.png)

## Read in and Explore Data Set

In [None]:
# Import libraries
import numpy as np
import pandas as pd

In [None]:
# Read in data
df = pd.read_csv('./data/hotel/hotelData.csv')

In [None]:
# Print first 5 rows
df.head()

In [None]:
# Print the column names
df.columns

In [None]:
# Print information about dataframe
df.info()

**Observations**:
- There are NaN values in orig_destination_distance, srch_ci, and srch_co columns.

In [None]:
# Check the percentage of Nan in dataset
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df['hotel_cluster'].count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(20)

## Convert Date into Numerical Values

**Convert the following into numerical values which will be relevant to the model**:
- date_time
- srch-ci
- srch-co

**Add additional features**<br>
Extract relevant information from the date columns and produce additional features:
- stay_dur: number of duration of stay
- no_of_days_bet_booking: number of days between the booking and
- Cin_day: Check-in day
- Cin_month: Check-in month
- Cin_year: Check-out year

In [None]:
# Function to convert date object into relevant attributes
def convert_date_into_days(dataframe):
    dataframe['srch_ci'] = pd.to_datetime(dataframe['srch_ci'])
    dataframe['srch_co'] = pd.to_datetime(dataframe['srch_co'])
    dataframe['date_time'] = pd.to_datetime(dataframe['date_time'])
    
    dataframe['stay_dur'] = (dataframe['srch_co'] - dataframe['srch_ci']).astype('timedelta64[D]')
    dataframe['no_of_days_bet_booking'] = (dataframe['srch_ci'] - dataframe['date_time']).astype('timedelta64[D]')
    
    # For hotel check-in
    # Month, Year, Day
    dataframe['Cin_day'] = dataframe["srch_ci"].apply(lambda x: x.day)
    dataframe['Cin_month'] = dataframe["srch_ci"].apply(lambda x: x.month)
    dataframe['Cin_year'] = dataframe["srch_ci"].apply(lambda x: x.year)

In [None]:
# Call function convert_date_into_days()
convert_date_into_days(df)

In [None]:
# Print information about dataframe
df.info()

**Observations**:
- There are NaN values in orig_destination_distance, Cin_year, Cin_month, Cin_day, no_of_days_bet_booking, stay_dur, srch_ci, srch_co columns.

In [None]:
# Check the percentage of Nan in dataset
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df['hotel_cluster'].count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

## Fill Missing Data with Some Values

In [None]:
# Fill columns with nan values with some values
df['Cin_day'] = df['Cin_day'].fillna(26.0)
df['Cin_month'] = df['Cin_month'].fillna(8.0)
df['Cin_year'] = df['Cin_year'].fillna(2014.0)
df['stay_dur'] = df['stay_dur'].fillna(1.0)
df['no_of_days_bet_booking'] = df['no_of_days_bet_booking'].fillna(0.0)

## Fill Missing Data with Mean Values

In [None]:
# Fill average values in place for nan, fill with mean
df['orig_destination_distance'].fillna(df['orig_destination_distance'].mean(), inplace=True)

In [None]:
df.head()

## Drop Columns that are not Useful

**To drop**:
- date_time
- srch_ci
- srch_co
- user_id
- srch_destination_type_id
- srch_destination_id

In [None]:
# Remove datetime object and o## Drop Columns that are not Useful
user_id = df['user_id']
columns = ['date_time', 'srch_ci', 'srch_co','user_id','srch_destination_type_id','srch_destination_id']
df.drop(columns=columns,axis=1,inplace=True)

In [None]:
# Print information about dataframe
df.info()

**Observations**:
- The dataset is now preprocessed and it is ready to fit into the model.
- All object values are converted into numerical values.

<center>
  <a href="MLSD-02-DataPreprocessing-C.ipynb" target="_self">Data Preprocessing C</a> | <a href="./">Content Page</a> | <a href="MLSD-02-DataPreprocessing-E.ipynb">Data Preprocessing E | <a href="MLSD-02-DataPreprocessing-Ex-1.ipynb">Data Preprocessing Exercise</a>
</center>