## Combining Datasets

Datasets we have
1. Fitness First
   1. Junction 10 (North) 4.5 rating, 284 reviews
   2. One George Street (Central/South) 4.4 rating, 283 reviews
   3. Tampines (East) 4.1 rating, 198 reviews
   4. Paya Lebar (East) 3.9 rating, 197 reviews
2. Virgin Active
   1. Tanjong Pagar (Central/South) 4.3 rating, 438 reviews
   2. Paya Lebar (East) 4.3 rating, 219 reviews

The dataframe outputs we want are
1. virgin_active_reviews
   1. virgin_active_positive_reviews
   2. virgin_active_negative_reviews
2. fitness_first_reviews
   1. fitness_first_positive_reviews
   2. fitness_first_negative_reviews
3. competitor_reviews

The other notebooks will analyze competitor gym reviews and ratings for better decision-making using the following NLP techniques
1.  Word Clouds by Rating: Compare word clouds for high (4-5 stars) and low (1-2 stars) ratings to identify key strengths and weaknesses.
2.  Topic Modeling (LDA): Discover hidden themes (e.g., equipment, classes, staff) to understand common concerns and strengths.
3.  Sentiment by Topic: Analyze sentiment for specific topics (e.g., "equipment" or "classes") across different rating levels to pinpoint areas for improvement.

In [96]:
import pandas as pd
import plotly.express as px

##### Combine the Datasets

In [97]:
ff_junction = pd.read_csv('datasets/fitness-first-georgest.csv')
ff_george = pd.read_csv('datasets/fitness-first-junction10.csv')
ff_paya = pd.read_csv('datasets/fitness-first-paya.csv')
va_tanjong = pd.read_csv('datasets/virgin-active-tanjong.csv')
va_paya = pd.read_csv('datasets/virgin-active-paya.csv')

In [98]:
# We will use use pd.concat() to combine the dataframes row-wise
fitness_first_reviews = pd.concat([ff_junction, ff_george, ff_paya], axis=0, ignore_index=True)
virgin_active_reviews = pd.concat([va_paya, va_tanjong], axis=0, ignore_index=True)

##### Clean the Data and Feature Engineering (Updated)
1. Select the relevant columns: Keep only ['review_text', 'rating', 'published_at_date', 'place_name'].
2. Rename the 'review_text' column to review.
3. Add a new column 'length' that shows the length of each review.
4. Separate the 'place_name' column into two new columns: gym and branch.
5. Create a count plot to show the distribution of rating categories by year.
6. Keep the necessary columns: ['review', 'rating', 'gym', 'branch', 'length', 'year'].
7. Remove rows where the review column has null values.
8. Export the final DataFrame to CSV files: 'fitness_first_reviews.csv' and 'virgin_active_reviews.csv'.

## Fitness First

In [99]:
# 1. Select the relevant columns: Keep only ['review_text', 'rating', 'published_at_date', 'place_name'].
fitness_first_reviews = fitness_first_reviews[['review_text','rating','published_at_date','place_name']]

# 2. Rename the 'review_text' column to review.
fitness_first_reviews.rename(columns={'review_text': 'review'}, inplace=True)

# 3. Add a new column that shows the length of each review.
fitness_first_reviews['length'] = fitness_first_reviews['review'].astype(str).apply(len)

fitness_first_reviews.head(2)

Unnamed: 0,review,rating,published_at_date,place_name,length
0,Best gym so far,5,2024-12-17T06:47:30,Fitness First - Tampines (CPF building),15
1,Khalifa n Khai all the front deck so friendly ...,5,2024-12-17T06:47:30,Fitness First - Tampines (CPF building),56


In [100]:
# 4. Separate the 'place_name' column into two new columns: gym and branch.
fitness_first_reviews['branch'] = fitness_first_reviews['place_name'].apply(lambda x: 'Tampines' if 'Tampines' in x else 'Paya Lebar')
fitness_first_reviews['gym'] = fitness_first_reviews['place_name'].apply(lambda x: 'Fitness First' if 'Fitness First' in x else '')
fitness_first_reviews.head(2)

Unnamed: 0,review,rating,published_at_date,place_name,length,branch,gym
0,Best gym so far,5,2024-12-17T06:47:30,Fitness First - Tampines (CPF building),15,Tampines,Fitness First
1,Khalifa n Khai all the front deck so friendly ...,5,2024-12-17T06:47:30,Fitness First - Tampines (CPF building),56,Tampines,Fitness First


In [101]:
# 5. Create a count plot to show the distribution of rating categories by year.

# Convert 'published_at_date' to datetime
fitness_first_reviews['published_at_date'] = pd.to_datetime(fitness_first_reviews['published_at_date'])

# Extract year from 'published_at_date'
fitness_first_reviews['year'] = fitness_first_reviews['published_at_date'].dt.year

# Create a count plot for rating categories by year
fig = px.histogram(fitness_first_reviews, x='year', color='rating', 
                   title='Fitness First: Count of Rating Categories by Year',
                   labels={'year': 'Year', 'rating': 'Rating Category'},
                   category_orders={'rating': [1, 2, 3, 4, 5]})

# Show the plot
fig.show()

In [102]:
# 6. Keep the necessary columns: ['review', 'rating', 'gym', 'branch', 'length', 'year'].
final_fitness_first_reviews = fitness_first_reviews[['review','rating','gym','branch','length','year']]
final_fitness_first_reviews

Unnamed: 0,review,rating,gym,branch,length,year
0,Best gym so far,5,Fitness First,Tampines,15,2024
1,Khalifa n Khai all the front deck so friendly ...,5,Fitness First,Tampines,56,2024
2,"The staff, Liyana and Khalid are friendly, out...",5,Fitness First,Tampines,76,2024
3,Have been working out in tampines ff for a whi...,5,Fitness First,Tampines,107,2024
4,"Great staff, Well maintained equipment. Always...",5,Fitness First,Tampines,64,2024
...,...,...,...,...,...,...
674,Biggest disappointment of a gym. Only one squa...,1,Fitness First,Paya Lebar,386,2018
675,,5,Fitness First,Paya Lebar,3,2018
676,,5,Fitness First,Paya Lebar,3,2018
677,,1,Fitness First,Paya Lebar,3,2018


In [103]:
# 7. Remove rows where the review column has null values.
final_fitness_first_reviews = final_fitness_first_reviews.dropna(subset=['review'])
final_fitness_first_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 441 entries, 0 to 678
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   review  441 non-null    object
 1   rating  441 non-null    int64 
 2   gym     441 non-null    object
 3   branch  441 non-null    object
 4   length  441 non-null    int64 
 5   year    441 non-null    int32 
dtypes: int32(1), int64(2), object(3)
memory usage: 22.4+ KB


In [104]:
# 8. Export the final DataFrame to CSV files: 'fitness_first_reviews.csv' and 'virgin_active_reviews.csv'.
final_fitness_first_reviews.to_csv('filtered-datasets/fitness_first_reviews.csv', index=False)