# Data Analysis

In this notebook we will analyze the data we have available.

In [1]:
# Imports
import pandas as pd
import os

In [2]:
# Go one folder back
os.chdir('..')

In [3]:
file_url_1 = "Data/referance_rooms-1737378184366.csv"

df_1 = pd.read_csv(file_url_1)
df_1

Unnamed: 0,hotel_id,lp_id,room_id,room_name
0,13484077,lp23e8ef,1142730702,Double or Twin Room
1,13487663,lp6554de34,1141927122,House
2,13462809,lp6556c3dc,1142722063,Room
3,13530116,lp6555450b,1141968275,Triple Room
4,13530071,lp6557a92c,1142513784,Apartment
...,...,...,...,...
99995,21684,lp6561b025,2168409,Two-Bedroom Suite
99996,21684,lp6561b025,2168411,Deluxe Triple Room
99997,21684,lp6561b025,2168412,Deluxe Queen Room with Two Queen Beds
99998,21684,lp6561b025,2168413,Classic Quadruple Room


### First dataset

In [4]:
df_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   hotel_id   100000 non-null  int64 
 1   lp_id      100000 non-null  object
 2   room_id    100000 non-null  int64 
 3   room_name  100000 non-null  object
dtypes: int64(2), object(2)
memory usage: 3.1+ MB


In [5]:
df_1['hotel_id'].value_counts()

hotel_id
13495031    99
3657540     71
236935      64
13898489    56
507265      53
            ..
13631599     1
13631966     1
7855856      1
13633991     1
13667015     1
Name: count, Length: 40011, dtype: int64

In [6]:
df_1['lp_id'].value_counts()

lp_id
lp65556118    99
lp6559653a    71
lp65682034    64
lp42f57       56
lp6568ac07    53
              ..
lp1868c7       1
lpf5ec5        1
lp9bfad        1
lpc8873        1
lp1da7e4       1
Name: count, Length: 40011, dtype: int64

In [7]:
df_1['room_id'].value_counts()

room_id
85638016      1
85638017      1
85638018      1
85638019      1
85638020      1
             ..
1141968275    1
1142513784    1
1141970143    1
1141970185    1
1141970194    1
Name: count, Length: 100000, dtype: int64

In [8]:
df_1['room_name'].value_counts()

room_name
One-Bedroom Apartment                                                 3635
Apartment                                                             2875
Two-Bedroom Apartment                                                 2625
Double Room                                                           2386
Holiday Home                                                          1647
                                                                      ... 
Apartment (Quadruplo)                                                    1
Standard Double Room, 1 Double Bed, Private Bathroom, Mountainside       1
Economy Double Room, Terrace, Tower                                      1
Basic Room, Women only                                                   1
Condo, 3 Bedrooms, Patio (Poplar Point Condo Unit 12F)                   1
Name: count, Length: 27936, dtype: int64

This DataFrame contains 100000 rows and 4 columns, representing hotel room inventory data:

- hotel_id: A unique numerical identifier for each hotel.
- lp_id: A unique alphanumeric identifier for a specific listing page associated with a hotel.
- room_id: A unique alphanumeric identifier for each room.
- room_name: A textual description of the room type. There are 27936 unique room names, with many names repeating across different hotels and listing pages. This indicates a variety of room types and their availability across the dataset.

In [9]:
df_1.isnull().sum()

hotel_id     0
lp_id        0
room_id      0
room_name    0
dtype: int64

There aren't null values in our dataset.

# Second dataset

In [10]:
file_url_2 = "Data/updated_core_rooms.csv"

df_2 = pd.read_csv(file_url_2)
df_2

Unnamed: 0,core_room_id,core_hotel_id,lp_id,supplier_room_id,supplier_name,supplier_room_name
0,1,506732,lp7bb6c,200979491,Expedia,Superior Double Room
1,2,509236,lp7c534,200998017,Expedia,"Deluxe Room, Balcony"
2,3,516326,lp7e0e6,201144757,Expedia,Female Dormitory- 3 Beds
3,4,495330,lp78ee2,201028863,Expedia,"Standard Apartment, 2 Bedrooms (6 people)"
4,5,970167,lpecdb7,218116045,Expedia,"Traditional Cottage, 2 Bedrooms, Harbor View"
...,...,...,...,...,...,...
2869051,2912439,193359,lp2f34f,323872346,Expedia,"Deluxe Room, 1 King Bed with Sofa bed"
2869052,2912440,143473,lp23071,230770971,Expedia,Ocean Bay Pool Room
2869053,2912441,1701692958,lp656dc61e,322166812,Expedia,8 Berth Luxury Caravan
2869054,2912442,143473,lp23071,315521742,Expedia,Beach Room


In [11]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2869056 entries, 0 to 2869055
Data columns (total 6 columns):
 #   Column              Dtype 
---  ------              ----- 
 0   core_room_id        int64 
 1   core_hotel_id       int64 
 2   lp_id               object
 3   supplier_room_id    int64 
 4   supplier_name       object
 5   supplier_room_name  object
dtypes: int64(3), object(3)
memory usage: 131.3+ MB


In [12]:
df_2['core_room_id'].value_counts()

core_room_id
2912404    1
2912405    1
2912406    1
2912407    1
2912408    1
          ..
4          1
5          1
6          1
7          1
8          1
Name: count, Length: 2869056, dtype: int64

In [13]:
df_2['core_hotel_id'].value_counts()

core_hotel_id
1701587331    174
1700215476    112
1700205553    111
228186        109
116919        106
             ... 
628023          1
624782          1
1701692904      1
1701692909      1
1701692925      1
Name: count, Length: 813834, dtype: int64

In [14]:
df_2['lp_id'].value_counts()

lp_id
lp656c2983    174
lp65573ab4    112
lp655713f1    111
lp37b5a       109
lp1c8b7       106
             ... 
lp99537         1
lp9888e         1
lp656dc5e8      1
lp656dc5ed      1
lp656dc5fd      1
Name: count, Length: 813834, dtype: int64

In [15]:
df_2['supplier_room_id'].value_counts()

supplier_room_id
218174788    2
318684501    2
314208359    2
315667927    2
216048770    2
            ..
323872355    1
323872411    1
322166900    1
323872346    1
200998017    1
Name: count, Length: 2825552, dtype: int64

In [16]:
df_2['supplier_name'].value_counts()

supplier_name
Expedia    2869056
Name: count, dtype: int64

In [17]:
df_2['supplier_room_name'].value_counts()

supplier_room_name
Standard Double Room                                                     45313
Double Room                                                              43990
Apartment                                                                43158
House                                                                    36772
Deluxe Double Room                                                       34223
                                                                         ...  
Deluxe Room, 1 Queen Bed (Riviera)                                           1
Lit simple en refuge pour 4                                                  1
Suite della Serva                                                            1
Standard Tent, 1 Queen Bed                                                   1
Studio, 1 King Bed, Accessible, Balcony (Mobility & Hearing, Bathtub)        1
Name: count, Length: 704452, dtype: int64

This DataFrame, consisting of 2869056 rows and 6 columns, establishes a mapping between core room and hotel identifiers and their corresponding supplier-specific details. It serves as a bridge, linking standardized internal IDs with external supplier information.

- core_room_id: A unique numerical identifier for a room within a standardized internal system. This ID acts as the primary key for room identification.
- core_hotel_id: A unique numerical identifier for a hotel within the same standardized internal system. This ID serves as the primary key for hotel identification.
- lp_id: An alphanumeric identifier for a listing page. This column relates the core room and hotel IDs to a specific listing page, potentially from a third party.
- supplier_room_id: An identifier for the room as provided by the external supplier. This ID is specific to the supplier's system.
- supplier_name: The name of the external supplier which is Expedia.
- supplier_room_name: The descriptive name of the room as provided by the external supplier.

In [18]:
df_2.isnull().sum()

core_room_id          0
core_hotel_id         0
lp_id                 0
supplier_room_id      0
supplier_name         0
supplier_room_name    1
dtype: int64

There is only 1 null value in the supplier room name.

## Data Aggregation
Let's see if we can join both datasets by 'lp_id'.

Check how many ids are equal in both datasets:
df_1['hotel_id'] and df_2['core_hotel_id']

In [19]:
# Convert the columns to sets
set_1 = set(df_1['lp_id'])
set_2 = set(df_2['lp_id'])

# Find the intersection of the sets
equal_ids = set_1.intersection(set_2)

# Count the number of equal IDs
count_equal_ids = len(equal_ids)

# Display the results
print("Number of equal IDs:", count_equal_ids)

Number of equal IDs: 28638


Therefore, we can join both dataframes using the columns 'lp_id'.