# Performance Profiling for Hotel Review Analytics

This notebook focuses on **performance profiling** for our hotel analytics platform used by HospitalityTech Solutions’ hotel clients. We work with reviews to ensure that real-world data scale is reflected, and we demonstrate how both **Python code** and **SQLite queries** behave under load.

Our business goal in this notebook is to:
- Ensure that core analytics operations (filtering, aggregations, joins, clustering) run fast enough for interactive dashboards.
- Quantify performance gains from optimizations (in-memory filtering vs full data, database indexing, and code-level profiling).


In [1]:
import pandas as pd
import numpy as np
import sqlite3
import time
import cProfile
import pstats
import io
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

In [2]:
review_df = pd.read_csv('data/reviews.csv')
author_df = pd.read_csv('data/authors.csv')
hotel_df = pd.read_csv('data/hotels.csv')

In [3]:
cleaned_review_df = review_df.dropna().drop_duplicates(subset=['text']).copy()

## Baseline Dataset Overview and Memory Footprint

We first inspect the size and structure of our cleaned review data: [file:1]

- **Review table shape**: total number of reviews and columns after basic cleaning.  
- **Hotel table shape**: number of hotels we can analyze and benchmark.  
- **Author table shape**: number of unique reviewers, enabling analysis of review helpfulness and reviewer behavior.  

We also compute **deep memory usage** for `cleaned_review_df` to understand how much RAM our analytics pipeline consumes.

In [4]:
print("REVIEW TABLE SHAPE:", review_df.shape)
print("REVIEW (FILTERED) TABLE SHAPE:", cleaned_review_df.shape)
print("HOTEL TABLE SHAPE:", hotel_df.shape)
print("AUTHOR TABLE SHAPE:", author_df.shape)

print("\nMemory Usage (Full Review Dataset)")
cleaned_review_df.info(memory_usage='deep')

REVIEW TABLE SHAPE: (754798, 16)
REVIEW (FILTERED) TABLE SHAPE: (343758, 16)
HOTEL TABLE SHAPE: (3888, 1)
AUTHOR TABLE SHAPE: (524023, 7)

Memory Usage (Full Review Dataset)
<class 'pandas.DataFrame'>
Index: 343758 entries, 0 to 754789
Data columns (total 16 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   id                        343758 non-null  int64  
 1   author_id                 343758 non-null  str    
 2   offering_id               343758 non-null  int64  
 3   overall                   343758 non-null  float64
 4   service                   343758 non-null  float64
 5   cleanliness               343758 non-null  float64
 6   value                     343758 non-null  float64
 7   location_rating           343758 non-null  float64
 8   sleep_quality             343758 non-null  float64
 9   rooms                     343758 non-null  float64
 10  title                     343758 non-null  str    
 11

## Prioritizing High-Quality Reviews (80th Percentile Filter)

To improve both **business relevance** and **performance**, we filter reviews based on `author_num_helpful_votes`.

- We compute the **80th percentile** threshold of helpful votes across all reviews.  
- We keep only reviews where `author_num_helpful_votes` is at or above this threshold.  

In [5]:
start = time.time()

threshold_80 = np.percentile(
    cleaned_review_df['author_num_helpful_votes'].dropna(),
    80
)

filtered_reviews = cleaned_review_df[
    cleaned_review_df['author_num_helpful_votes'] >= threshold_80
].copy()

end = time.time()
filter_time = end - start

print("80th Percentile Threshold:", threshold_80)
print("Rows Before:", review_df.shape[0])
print("Rows After:", filtered_reviews.shape[0])
print("Filtering Time:", round(filter_time, 4), "sec")

80th Percentile Threshold: 22.0
Rows Before: 754798
Rows After: 71102
Filtering Time: 0.0253 sec


## Measuring Core Analytics Operations (Full vs Filtered Data)

Two operations are central to the hotel analytics platform:

1. **Aggregation**: computing average ratings per hotel (e.g., `overall` score grouped by `offering_id`).  
2. **Join**: merging review data with hotel master data so each hotel’s metrics appear with its attributes (name, location, etc.).  

We measure how long these operations take on:  
- The **full dataset** (all reviews).  
- The **filtered dataset** (top 20% most helpful reviews).  


In [6]:
start = time.time()

agg_full = review_df.groupby("offering_id")["overall"].mean()

end = time.time()
agg_full_time = end - start

print("Aggregation Time (Full Data):", round(agg_full_time, 4), "sec")

Aggregation Time (Full Data): 0.0189 sec


In [7]:
start = time.time()

merged_full = review_df.merge(
    hotel_df,
    on="offering_id",
    how="left"
)

end = time.time()
join_full_time = end - start

print("Join Time (Full Data):", round(join_full_time, 4), "sec")

Join Time (Full Data): 0.0324 sec


In [8]:
start = time.time()

agg_filtered = filtered_reviews.groupby("offering_id")["overall"].mean()

end = time.time()
agg_filtered_time = end - start

print("Aggregation Time (Filtered Data):", round(agg_filtered_time, 4), "sec")

Aggregation Time (Filtered Data): 0.0026 sec


In [9]:
start = time.time()

merged_filtered = filtered_reviews.merge(
    hotel_df,
    on="offering_id",
    how="left"
)

end = time.time()
join_filtered_time = end - start

print("Join Time (Filtered Data):", round(join_filtered_time, 4), "sec")

Join Time (Filtered Data): 0.0056 sec


In [10]:
performance_summary = pd.DataFrame({
    "Operation": [
        "Aggregation",
        "Join"
    ],
    "Full Data Time (sec)": [
        round(agg_full_time,4),
        round(join_full_time,4)
    ],
    "Filtered Data Time (sec)": [
        round(agg_filtered_time,4),
        round(join_filtered_time,4)
    ]
})

performance_summary

Unnamed: 0,Operation,Full Data Time (sec),Filtered Data Time (sec)
0,Aggregation,0.0189,0.0026
1,Join,0.0324,0.0056


### Interpretation: Performance Gains from Filtering

The summary table reports timing (in seconds) for aggregation and join operations on both full and filtered data.

Typical pattern observed:

- **Aggregation** time drops substantially (e.g., ~0.019 s → ~0.003 s).  
- **Join** time also improves (e.g., ~0.032 s → ~0.006 s).  

**Business impact:**  
If a dashboard repeatedly computes these metrics (e.g., when the user changes filters), the reduction in runtime translates into:

- **Faster response times** for hotel managers.  
- Ability to scale to more concurrent users or more complex analytics without upgrading hardware.  

This demonstrates a clear trade-off: by focusing on high-helpfulness reviews, we gain **both quality and performance**.


## Moving to SQLite for Production-Like Workloads

To align with the assignment requirement of using a **SQLite database** with 50K–80K reviews, we persist our filtered review data, hotel data, and author data into an on-disk database (`performance_testing.db`).

In [11]:
conn = sqlite3.connect("data/performance_testing.db")

filtered_reviews.to_sql("reviews", conn, if_exists="replace", index=False)
hotel_df.to_sql("hotels", conn, if_exists="replace", index=False)
author_df.to_sql("authors", conn, if_exists="replace", index=False)

524023

## Query-Level Profiling: Average Ratings per Hotel

We profile a key SQL query that computes **average aspect ratings per hotel**.

In [12]:
aggregation_query = """
SELECT offering_id,
       AVG(service) AS avg_service,
       AVG(cleanliness) AS avg_cleanliness,
       AVG(value) AS avg_value
FROM reviews
GROUP BY offering_id
"""

start = time.time()
baseline_result = pd.read_sql(aggregation_query, conn)
end = time.time()

print("Aggregation Time (Before Index):", round(end - start, 4), "seconds")

Aggregation Time (Before Index): 0.1077 seconds


In [13]:
explain_query = "EXPLAIN QUERY PLAN " + aggregation_query
print(pd.read_sql(explain_query, conn))

   id  parent  notused                        detail
0   6       0      216                  SCAN reviews
1   8       0        0  USE TEMP B-TREE FOR GROUP BY


## Optimization: Adding Index on `offering_id`

**Business reasoning:**  
Most hotel-level analytics group or filter by `offering_id`. Indexing this column should:  

- Reduce I/O by allowing SQLite to locate groups more efficiently.  
- Shorten query times, especially as review volume grows over multiple years.

In [14]:
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_reviews_offering
ON reviews(offering_id)
""")
conn.commit()

In [15]:
start = time.time()
optimized_result = pd.read_sql(aggregation_query, conn)
end = time.time()

print("Aggregation Time (After Index):", round(end - start, 4), "seconds")

Aggregation Time (After Index): 0.112 seconds


In [16]:
print(pd.read_sql(explain_query, conn))

   id  parent  notused                                         detail
0   7       0      222  SCAN reviews USING INDEX idx_reviews_offering


### Interpretation: Impact of Indexing

After adding the index, two key changes occur:

- The query plan now shows `SCAN reviews USING INDEX idx_reviews_offering`, meaning SQLite leverages the index for grouping.  
- Runtime is similar or slightly changed on this environment, but the **plan is now scalable** for larger datasets and more complex filters.  


## Query-Level Profiling: Joining Reviews and Hotels

We first measure the join performance before any index on the `hotels` table.


In [17]:
join_query = """
SELECT r.offering_id,
       r.author_num_helpful_votes,
       a.offering_id
FROM reviews r
JOIN hotels a
ON r.offering_id = a.offering_id
"""

start = time.time()
baseline_join_result = pd.read_sql(join_query, conn)
end = time.time()

print("Join Time (Before Index): ", round(end - start, 4), "seconds")

Join Time (Before Index):  0.1552 seconds


In [18]:
conn.execute("""
CREATE INDEX IF NOT EXISTS idx_hotels_offering
ON hotels(offering_id)
""")
conn.commit()

In [19]:
join_query = """
SELECT r.offering_id,
       r.author_num_helpful_votes,
       a.offering_id
FROM reviews r
JOIN hotels a
ON r.offering_id = a.offering_id
"""

start = time.time()
baseline_join_result = pd.read_sql(join_query, conn)
end = time.time()

print("Join Time (After Index): ", round(end - start, 4), "seconds")

Join Time (After Index):  0.1365 seconds


## Code-Level Profiling: Pandas Aggregation

Beyond database queries, we also profile **Python/Pandas code** to identify hotspots in our in-memory analytics pipeline.

We define `pandas_aggregation()` to compute hotel-level averages (service, cleanliness, value, location_rating, sleep_quality, rooms) using `groupby().mean()` on the filtered review data. We then run `cProfile` and inspect the top functions by cumulative time. 

In [20]:
def pandas_aggregation():
    return filtered_reviews.groupby('offering_id')[
        ['service', 'cleanliness', 'value', 'location_rating', 'sleep_quality', 'rooms']
    ].mean()

pr = cProfile.Profile()
pr.enable()

pandas_aggregation()

pr.disable()

s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumtime')
ps.print_stats(10)

print(s.getvalue())

         2243 function calls (2208 primitive calls) in 0.017 seconds

   Ordered by: cumulative time
   List reduced from 426 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/2    0.000    0.000    0.012    0.006 /Users/aviralgoyal/Desktop/NUS/Courses/Sem 2/IS5126 Hands-On Applied Analytics/A1/.venv/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3665(run_code)
      3/2    0.000    0.000    0.012    0.006 {built-in method builtins.exec}
        1    0.000    0.000    0.011    0.011 /var/folders/hd/1qp006zn48sdflzfg6_25rtr0000gn/T/ipykernel_61141/3244918794.py:1(<module>)
        1    0.000    0.000    0.011    0.011 /var/folders/hd/1qp006zn48sdflzfg6_25rtr0000gn/T/ipykernel_61141/3244918794.py:1(pandas_aggregation)
        1    0.000    0.000    0.011    0.011 /Users/aviralgoyal/Desktop/NUS/Courses/Sem 2/IS5126 Hands-On Applied Analytics/A1/.venv/lib/python3.13/site-packages/pandas/core/groupby/groupby.py:2192

## Preparing Hotel-Level Feature Profiles

To support **competitive benchmarking** and segmentation, we construct a hotel-level feature matrix: 

- We group by `offering_id` and compute mean values for service, cleanliness, value, location_rating, sleep_quality, and rooms.  
- We standardize these features using `StandardScaler` to ensure fair comparison across different rating scales.  

**Business use case:**  
These standardized profiles allow us to cluster hotels into groups with similar performance patterns, which managers can use to:

- Identify “true peers” for benchmarking.  
- Compare underperforming hotels against high-performing ones in the same cluster to identify best practices.

In [21]:
hotel_profiles = filtered_reviews.groupby('offering_id')[['service', 'cleanliness', 'value',
                                            'location_rating', 'sleep_quality', 'rooms']].mean()

scaler = StandardScaler()
scaled_features = scaler.fit_transform(hotel_profiles)

In [22]:
def run_kmeans():
    kmeans = KMeans(n_clusters=4, n_init=10, random_state=42)
    kmeans.fit(scaled_features)

pr = cProfile.Profile()
pr.enable()

run_kmeans()

pr.disable()

s = io.StringIO()
ps = pstats.Stats(pr, stream=s).sort_stats('cumtime')
ps.print_stats(10)

print(s.getvalue())

         53699 function calls (53594 primitive calls) in 0.218 seconds

   Ordered by: cumulative time
   List reduced from 515 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      3/2    0.000    0.000    0.217    0.108 /Users/aviralgoyal/Desktop/NUS/Courses/Sem 2/IS5126 Hands-On Applied Analytics/A1/.venv/lib/python3.13/site-packages/IPython/core/interactiveshell.py:3665(run_code)
      3/2    0.000    0.000    0.217    0.108 {built-in method builtins.exec}
        1    0.000    0.000    0.217    0.217 /var/folders/hd/1qp006zn48sdflzfg6_25rtr0000gn/T/ipykernel_61141/866295218.py:1(<module>)
        1    0.000    0.000    0.216    0.216 /var/folders/hd/1qp006zn48sdflzfg6_25rtr0000gn/T/ipykernel_61141/866295218.py:1(run_kmeans)
        1    0.000    0.000    0.216    0.216 /Users/aviralgoyal/Desktop/NUS/Courses/Sem 2/IS5126 Hands-On Applied Analytics/A1/.venv/lib/python3.13/site-packages/sklearn/base.py:1319(wrapper)
        3    

### Interpretation: Clustering Performance

The profiling results indicate that the majority of time is spent inside scikit-learn’s KMeans routines and parallel utilities.

Key points:

- Runtime (~0.2 seconds) is acceptable for periodic recomputation (e.g., daily or weekly).  
- For interactive re-clustering with many parameter changes, we might cache results or reduce feature dimensionality.  

From a business point of view, clustering is **fast enough** to be a regular part of the analytics pipeline.

### Memory Footprint: Cleaned Full Dataset

The printed memory usage (~hundreds of MB) quantifies the RAM needed to hold the fully cleaned review dataset in memory.

**Business implication:**  
Running heavy analytics directly on this full dataset may be challenging on resource-constrained servers, especially if multiple users or notebooks run concurrently. This motivates our filtering and database offloading strategies.


In [23]:
memory_mb = cleaned_review_df.memory_usage(deep=True).sum() / 1024**2
print("Memory Usage (Filtered Dataset):", round(memory_mb, 2), "MB")

Memory Usage (Filtered Dataset): 461.47 MB


### Memory Footprint: High-Helpfulness Filtered Dataset

After applying the 80th percentile helpfulness filter, memory usage drops to roughly one-quarter of the full cleaned dataset.

**Business impact:**  

- Lower memory usage means **cheaper infrastructure** (smaller instances) or more capacity for concurrent users.  
- It also reduces the risk of slowdowns or crashes during heavy computations, improving reliability of the analytics platform for hotel managers. 


In [24]:
memory_mb = filtered_reviews.memory_usage(deep=True).sum() / 1024**2
print("Memory Usage (Filtered Dataset):", round(memory_mb, 2), "MB")

Memory Usage (Filtered Dataset): 121.16 MB
