# Exploratory Data Analysis

This notebook contains our analysis of the provided data.

## PySpark Setup

First, we initialize our Spark session.

In [1]:
from pyspark.sql import SparkSession

import eda
import importlib
importlib.reload(eda)

spark = SparkSession.builder.getOrCreate()

## Customer Reservations

### Loading Data

In [2]:
customer_df = spark.read.csv("../data/customer-reservations.csv", header=True,inferSchema=True)

### Understanding Columns


In [3]:
customer_df.show(10)

+----------+-----------------------+--------------------+---------+------------+-------------+------------+-------------------+------------------+--------------+
|Booking_ID|stays_in_weekend_nights|stays_in_week_nights|lead_time|arrival_year|arrival_month|arrival_date|market_segment_type|avg_price_per_room|booking_status|
+----------+-----------------------+--------------------+---------+------------+-------------+------------+-------------------+------------------+--------------+
|  INN00001|                      1|                   2|      224|        2017|           10|           2|            Offline|              65.0|  Not_Canceled|
|  INN00002|                      2|                   3|        5|        2018|           11|           6|             Online|            106.68|  Not_Canceled|
|  INN00003|                      2|                   1|        1|        2018|            2|          28|             Online|              60.0|      Canceled|
|  INN00004|                

In [4]:
customer_df.printSchema()

root
 |-- Booking_ID: string (nullable = true)
 |-- stays_in_weekend_nights: integer (nullable = true)
 |-- stays_in_week_nights: integer (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: integer (nullable = true)
 |-- arrival_date: integer (nullable = true)
 |-- market_segment_type: string (nullable = true)
 |-- avg_price_per_room: double (nullable = true)
 |-- booking_status: string (nullable = true)



In [5]:
customer_df.describe().show()

+-------+----------+-----------------------+--------------------+-----------------+------------------+------------------+------------------+-------------------+------------------+--------------+
|summary|Booking_ID|stays_in_weekend_nights|stays_in_week_nights|        lead_time|      arrival_year|     arrival_month|      arrival_date|market_segment_type|avg_price_per_room|booking_status|
+-------+----------+-----------------------+--------------------+-----------------+------------------+------------------+------------------+-------------------+------------------+--------------+
|  count|     36275|                  36275|               36275|            36275|             36275|             36275|             36275|              36275|             36275|         36275|
|   mean|      NULL|      0.810723638869745|  2.2043004824259134|85.23255685733976|2017.8204272915232| 7.423652653342522|15.596995175740869|               NULL| 103.4235390764958|          NULL|
| stddev|      NULL|     

In [6]:
eda.print_num_null_per_column(customer_df)

Column                     Number of Nulls    
------------------------------------------
Booking_ID                 0                  
stays_in_weekend_nights    0                  
stays_in_week_nights       0                  
lead_time                  0                  
arrival_year               0                  
arrival_month              0                  
arrival_date               0                  
market_segment_type        0                  
avg_price_per_room         0                  
booking_status             0                  


In [7]:
eda.print_uniqe_per_column(customer_df, max_unique=20)


Column                     Unique Values and Their Frequencies (or the number of unique values for columns with more than 20 values)                                           
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Booking_ID                 36275                                                                                                                                               
stays_in_weekend_nights    {0: 16872, 1: 9995, 2: 9071, 3: 153, 4: 129, 5: 34, 6: 20, 7: 1}                                                                                    
stays_in_week_nights       {0: 2387, 1: 9488, 2: 11444, 3: 7839, 4: 2990, 5: 1614, 6: 189, 7: 113, 8: 62, 9: 34, 10: 62, 11: 17, 12: 9, 13: 5, 14: 7, 15: 10, 16: 2, 17: 3}    
lead_time                  352                                                                                              

### Graphing Data

In [16]:
customer_df.plot.hist("arrival_month", bins=12, title="Arival Month")

In [17]:
customer_df.plot.hist("lead_time", title="Lead Time")

In [18]:
customer_df.plot.hist("avg_price_per_room", title="Average Price Per Room")

In [19]:
customer_df.plot.hist("arrival_date", title="Day of Month of Arrival")

In [20]:
customer_df.plot.hist("stays_in_week_nights", title="Stays in Week Nights")

## Hotel Booking

### Loading Data

In [31]:
hotel_df = spark.read.csv("../data/hotel-booking.csv", header=True,inferSchema=True)

### Understanding Columns


In [32]:
hotel_df.show(10)

+------------+--------------+---------+------------+-------------+------------------------+-------------------------+-----------------------+--------------------+-------------------+-------+------------------+--------------------+
|       hotel|booking_status|lead_time|arrival_year|arrival_month|arrival_date_week_number|arrival_date_day_of_month|stays_in_weekend_nights|stays_in_week_nights|market_segment_type|country|avg_price_per_room|               email|
+------------+--------------+---------+------------+-------------+------------------------+-------------------------+-----------------------+--------------------+-------------------+-------+------------------+--------------------+
|Resort Hotel|             0|      342|        2015|         July|                      27|                        1|                      0|                   0|             Direct|    PRT|               0.0|Ernest.Barnes31@o...|
|Resort Hotel|             0|      737|        2015|         July|          

In [33]:
hotel_df.printSchema()

root
 |-- hotel: string (nullable = true)
 |-- booking_status: integer (nullable = true)
 |-- lead_time: integer (nullable = true)
 |-- arrival_year: integer (nullable = true)
 |-- arrival_month: string (nullable = true)
 |-- arrival_date_week_number: integer (nullable = true)
 |-- arrival_date_day_of_month: integer (nullable = true)
 |-- stays_in_weekend_nights: integer (nullable = true)
 |-- stays_in_week_nights: integer (nullable = true)
 |-- market_segment_type: string (nullable = true)
 |-- country: string (nullable = true)
 |-- avg_price_per_room: double (nullable = true)
 |-- email: string (nullable = true)



In [34]:
hotel_df.describe().show()

+-------+------------+-------------------+------------------+------------------+-------------+------------------------+-------------------------+-----------------------+--------------------+-------------------+-------+------------------+--------------------+
|summary|       hotel|     booking_status|         lead_time|      arrival_year|arrival_month|arrival_date_week_number|arrival_date_day_of_month|stays_in_weekend_nights|stays_in_week_nights|market_segment_type|country|avg_price_per_room|               email|
+-------+------------+-------------------+------------------+------------------+-------------+------------------------+-------------------------+-----------------------+--------------------+-------------------+-------+------------------+--------------------+
|  count|       78703|              78703|             78703|             78703|        78703|                   78703|                    78703|                  78703|               78703|              78703|  78298|     

In [35]:
eda.print_num_null_per_column(hotel_df)

Column                       Number of Nulls    
--------------------------------------------
hotel                        0                  
booking_status               0                  
lead_time                    0                  
arrival_year                 0                  
arrival_month                0                  
arrival_date_week_number     0                  
arrival_date_day_of_month    0                  
stays_in_weekend_nights      0                  
stays_in_week_nights         0                  
market_segment_type          0                  
country                      405                
avg_price_per_room           0                  
email                        0                  


In [54]:
eda.print_uniqe_per_column(hotel_df, max_unique=20)


Column                       Unique Values and Their Frequencies (or the number of unique values for columns with more than 20 values)                                                                                               
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
hotel                        {'City Hotel': 51822, 'Resort Hotel': 26881}                                                                                                                                                            
booking_status               {0: 50224, 1: 28479}                                                                                                                                                                                    
lead_time                    439                                                    

### Graphing Data

In [40]:
hotel_df.plot.hist("arrival_date_week_number", title="arrival_date_week_number")

In [42]:
hotel_df.plot.hist("lead_time", title="lead_time")

In [47]:
hotel_df.plot.hist("avg_price_per_room", bins=100, title="avg_price_per_room")

In [48]:
hotel_df.plot.hist("arrival_date_day_of_month", title="arrival_date_day_of_month")

In [51]:
hotel_df.plot.hist("stays_in_week_nights", bins=30, title="stays_in_week_nights")