
# Hotel Review Analytics – Data Preparation & Business Context

## Business Objective
This notebook prepares hotel review data for analytics and machine learning.

From a business perspective, the objective is to:

- Clean and standardize customer review data  
- Aggregate hotel-level performance metrics  
- Enable comparison of service quality  
- Prepare structured input for clustering analysis  

This supports strategic decisions such as identifying high-performing hotels, 
improving underperforming properties, and enhancing customer satisfaction.


In [1]:
import json
import sqlite3
import pandas as pd

In [2]:
JSON_FILE = "data/review.json"        
SQLITE_DB = "data/reviews_sample.db"

In [3]:
records = []

with open(JSON_FILE, "r", encoding="utf-8") as f:
    for line in f:
        records.append(json.loads(line))

df = pd.DataFrame(records)
print(f"Loaded {len(df)} raw records")

Loaded 878561 raw records


In [4]:
df.head()

Unnamed: 0,ratings,title,text,author,date_stayed,offering_id,num_helpful_votes,date,id,via_mobile
0,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...","“Truly is ""Jewel of the Upper Wets Side""”",Stayed in a king suite for 11 nights and yes i...,"{'username': 'Papa_Panda', 'num_cities': 22, '...",December 2012,93338,0,"December 17, 2012",147643103,False
1,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",“My home away from home!”,"On every visit to NYC, the Hotel Beacon is the...","{'username': 'Maureen V', 'num_reviews': 2, 'n...",December 2012,93338,0,"December 17, 2012",147639004,False
2,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",“Great Stay”,This is a great property in Midtown. We two di...,"{'username': 'vuguru', 'num_cities': 12, 'num_...",December 2012,1762573,0,"December 18, 2012",147697954,False
3,"{'service': 5.0, 'cleanliness': 5.0, 'overall'...",“Modern Convenience”,The Andaz is a nice hotel in a central locatio...,"{'username': 'Hotel-Designer', 'num_cities': 5...",August 2012,1762573,0,"December 17, 2012",147625723,False
4,"{'service': 4.0, 'cleanliness': 5.0, 'overall'...",“Its the best of the Andaz Brand in the US....”,I have stayed at each of the US Andaz properti...,"{'username': 'JamesE339', 'num_cities': 34, 'n...",December 2012,1762573,0,"December 17, 2012",147612823,False


In [6]:
df['review_date'] = pd.to_datetime(df['date'], errors='coerce')
max_date = df['review_date'].max()
cutoff_date = max_date - pd.DateOffset(years=5)
df = df[df["review_date"] >= cutoff_date]

print(f"After date filtering: {len(df)} reviews")

After date filtering: 754798 reviews


In [7]:
df['author_id'] = df['author'].apply(lambda x: x.get('id'))
df['author_name'] = df['author'].apply(lambda x: x.get('username'))
df['author_location'] = df['author'].apply(lambda x: x.get('location'))
df['author_num_reviews'] = df["author"].apply(lambda x: x.get("num_reviews"))
df['author_num_cities'] = df["author"].apply(lambda x: x.get("num_cities"))
df['author_num_helpful_votes'] = df["author"].apply(lambda x: x.get("num_helpful_votes"))
df['author_num_type_reviews'] = df["author"].apply(lambda x: x.get("num_type_reviews"))

In [8]:
df["overall"] = df["ratings"].apply(lambda x: x.get("overall"))
df["service"] = df["ratings"].apply(lambda x: x.get("service"))
df["cleanliness"] = df["ratings"].apply(lambda x: x.get("cleanliness"))
df["value"] = df["ratings"].apply(lambda x: x.get("value"))
df["location_rating"] = df["ratings"].apply(lambda x: x.get("location"))
df["sleep_quality"] = df["ratings"].apply(lambda x: x.get("sleep_quality"))
df["rooms"] = df["ratings"].apply(lambda x: x.get("rooms"))

In [9]:
author_df = (df[[
    'author_id',
    'author_name',
    'author_location',
    'author_num_reviews',
    'author_num_cities',
    'author_num_helpful_votes',
    'author_num_type_reviews'
]].drop_duplicates())

In [10]:
hotels_df = (df[[
    'offering_id'
]].drop_duplicates())

In [11]:
reviews_df = (df[[
    'id',
    'author_id',
    'offering_id',
    'overall',
    'service',
    'cleanliness',
    'value',
    'location_rating',
    'sleep_quality',
    'rooms',
    'title',
    'text',
    'review_date',
    'date_stayed',
    'via_mobile',
    'author_num_helpful_votes'
]].drop_duplicates())


### Data Cleaning

We remove the blank rows and duplicate review texts.

In [12]:
cleaned_review_df = reviews_df.dropna().drop_duplicates(subset=['text']).copy()

print("Total Number of records in Cleaned Dataset after removing blank rows and duplicate reviews: ", cleaned_review_df.shape)

Total Number of records in Cleaned Dataset after removing blank rows and duplicate reviews:  (343758, 16)



### Percentile Filter

We identify reviews with high number of helpful votes indicating reviews which are trusted and been found useful by other reviewers. We experiment with various percentile of data, calculating the number of helpful vote threshold for the selected data and checking the number of records. We then finalize the percentile filter value based on the trade-off between threshol and data volume, such that we get high threshold value along with good volume of data (50,000 - 80,000)

In [13]:
for q in range(50, 100, 5):
  threshold = cleaned_review_df['author_num_helpful_votes'].quantile(q/100)
  print(f"Percentile: {q}, threshold: {threshold},", "num of rows: ", len(cleaned_review_df[cleaned_review_df['author_num_helpful_votes'] >= threshold]))

Percentile: 50, threshold: 7.0, num of rows:  181736
Percentile: 55, threshold: 9.0, num of rows:  156102
Percentile: 60, threshold: 10.0, num of rows:  144946
Percentile: 65, threshold: 12.0, num of rows:  126417
Percentile: 70, threshold: 15.0, num of rows:  104527
Percentile: 75, threshold: 18.0, num of rows:  87608
Percentile: 80, threshold: 22.0, num of rows:  71102
Percentile: 85, threshold: 28.0, num of rows:  53165
Percentile: 90, threshold: 38.0, num of rows:  35220
Percentile: 95, threshold: 61.0, num of rows:  17334


## Database Creation

SQL query to create the SQLite Review database and store the data in the database for storage.

In [19]:
conn = sqlite3.connect(SQLITE_DB)
cursor = conn.cursor()

In [15]:
sql = """
DROP TABLE IF EXISTS authors;
DROP TABLE IF EXISTS hotels;
DROP TABLE IF EXISTS reviews;

CREATE TABLE authors (
auhtor_no INTEGER PRIMARY KEY AUTOINCREMENT,
author_id TEXT,
author_name TEXT,
author_location TEXT,
author_num_reviews INTEGER,
author_num_cities INTEGER,
author_num_helpful_votes INTEGER,
author_num_type_reviews INTEGER
);

CREATE TABLE hotels (
offering_id INTEGER PRIMARY KEY
);

CREATE TABLE reviews (
id INTEGER PRIMARY KEY,
author_no INTEGER,
author_id TEXT,
offering_id INTEGER,
overall REAL,
service REAL,
cleanliness REAL,
value REAL,
location_rating REAL,
sleep_quality REAL,
rooms REAL,
title TEXT,
text TEXT,
review_date DATE,
date_stayed TEXT,
via_mobile BOOLEAN,
author_num_helpful_votes INTEGER,

FOREIGN KEY(author_no) REFERENCES authors(author_no),
FOREIGN KEY(offering_id) REFERENCES hotels(offering_id)
);
"""

cursor.executescript(sql)
conn.commit()

In [16]:
author_df.to_sql("authors", conn, if_exists="append", index=False)
hotels_df.to_sql("hotels", conn, if_exists="append", index=False)
reviews_df.to_sql("reviews", conn, if_exists="append", index=False)

print("Data successfully stored in SQLite")

Data successfully stored in SQLite


## Index Creation

To support fast queries on the review database, we create indexes on the main access paths:

- `idx_reviews_date` on `reviews(review_date)` to speed up time-based analyses such as “last 12 months” or “last 5 years” trends.
- `idx_reviews_user` on `reviews(author_no)` to accelerate queries that analyze reviewer behaviour and helpfulness (e.g., top reviewers, reviewer reliability).
- `idx_reviews_hotel` on `reviews(offering_id)` to make hotel-level aggregations and dashboards (e.g., average ratings per hotel, hotel comparisons) more responsive.

These indexes are chosen based on the key business queries managers will run most often, improving performance without changing query logic.


In [17]:
index_sql = """
CREATE INDEX idx_reviews_date ON reviews(review_date);
CREATE INDEX idx_reviews_user ON reviews(author_no);
CREATE INDEX idx_reviews_hotel ON reviews(offering_id);
"""

cursor.executescript(index_sql)
conn.commit()

conn.close()
print("Indexes created and database finalized")

Indexes created and database finalized


In [18]:
author_df.to_csv("data/authors.csv", index = False)
reviews_df.to_csv("data/reviews.csv", index = False)
hotels_df.to_csv("data/hotels.csv", index = False)