# Yelp Dataset Exploration

   This Jupyter notebook is designed to explore the Yelp Dataset. It involves loading several JSON datasets into pandas DataFrames, preprocessing, merging, and exporting them for further analysis. Our goal is to understand the characteristics of businesses, users, reviews, and photos within a specific geographical location and domain (restaurants).


In [1]:
import pandas as pd
import numpy as np


## Loading Datasets

The first step in our exploration is to load various datasets provided by Yelp. These datasets include information on businesses, users, reviews, and photos. We use the `pd.read_json` function to load each dataset in chunks for efficient memory management.


In [2]:


# Define the path to your JSON file
file_path = "./dataset/jsons/yelp_academic_dataset_business.json"

# Directly concatenate the chunks read from the JSON file into a single DataFrame
business = pd.concat(
    pd.read_json(file_path, lines=True, chunksize=10000),
    ignore_index=True
)

# 'business' DataFrame now contains all the data from the JSON file


In [3]:


# Define the path to your JSON file
file_path = "./dataset/jsons/yelp_academic_dataset_user.json"

# Directly concatenate the chunks read from the JSON file into a single DataFrame
user = pd.concat(
    pd.read_json(file_path, lines=True, chunksize=10000),
    ignore_index=True
)

# The 'user' DataFrame now contains all the data from the JSON file


In [4]:


# Define the path to your JSON file
file_path = "./dataset/jsons/yelp_academic_dataset_review.json"

# Read the JSON file in chunks and concatenate directly without intermediate list comprehension
review = pd.concat(
    pd.read_json(file_path, lines=True,
                 dtype={'review_id': str, 'user_id': str,
                        'business_id': str, 'stars': 'int8',
                        'date': str, 'text': str, 'useful': 'int8',
                        'funny': 'int8', 'cool': 'int8'},
                 chunksize=10000),
    ignore_index=True
)

# Now 'review' contains the entire dataset


In [5]:


# Define the path to your JSON file
file_path = "./dataset/jsons/photos.json"

# Directly concatenate the chunks read from the JSON file into a single DataFrame
photo = pd.concat(
    pd.read_json(file_path, lines=True, chunksize=10000),
    ignore_index=True
)

# The 'photo' DataFrame now contains all the data from the JSON file

## Data Preprocessing

After loading the data, we focus on preprocessing. This includes filtering datasets based on specific criteria (like selecting only Nevada-based businesses), merging datasets, and calculating new metrics (such as an adjusted score for businesses based on their reviews).


In [6]:
# Group by 'business_id' and take the first 'photo_id' from each group
first_photo = photo.groupby('business_id', as_index=False).first()[['business_id', 'photo_id']]


In [7]:
list_of_states = ['NV']
business = business[business.state.isin(list_of_states)]

In [8]:
import pandas as pd
import numpy as np

def filter_restaurant_businesses(df):
    restaurant_set = set(['restaurants', 'fast food', 'sandwiches', 'caterers', 'deserts', 'burgers'])
    
    # Function to determine if a business is a restaurant
    def is_restaurant(categories):
        if pd.isna(categories):
            return False
        categories_set = set(map(str.strip, map(str.lower, categories.split(','))))
        if categories_set.intersection(restaurant_set):
            return True
        return False  # Assuming businesses not explicitly identified as restaurants are not restaurants
    
    mask = df['categories'].apply(is_restaurant)
    
    return df[mask]

business = filter_restaurant_businesses(business)


In [9]:
import pandas as pd
import numpy as np

# Assuming 'business' is your DataFrame and already defined.
# Perform the merge operation and assign the result back to the original DataFrame variable
business = pd.merge(business, first_photo, on="business_id", how="left")

# Fill NaN values in the 'photo_id' column with 'no-image'
business['photo_id'] = business['photo_id'].fillna('no-image')


# Calculate the global mean rating across all businesses
global_mean = (business['stars'] * business['review_count']).sum() / business['review_count'].sum()
# Determine the median review count (50th percentile) dynamically
k = business['review_count'].quantile(0.5)
# Calculate the adjusted score for each business
business['adjusted_score'] = ((business['review_count'] * business['stars']) + (k * global_mean)) / (business['review_count'] + k)
# Sorting the 'business' DataFrame by 'adjusted_score' in descending order, in place
business.sort_values(by='adjusted_score', ascending=False, inplace=True)
# reset index
business.reset_index(drop=True, inplace=True)
print(business.info())



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1705 entries, 0 to 1704
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   business_id     1705 non-null   object 
 1   name            1705 non-null   object 
 2   address         1705 non-null   object 
 3   city            1705 non-null   object 
 4   state           1705 non-null   object 
 5   postal_code     1705 non-null   object 
 6   latitude        1705 non-null   float64
 7   longitude       1705 non-null   float64
 8   stars           1705 non-null   float64
 9   review_count    1705 non-null   int64  
 10  is_open         1705 non-null   int64  
 11  attributes      1687 non-null   object 
 12  categories      1705 non-null   object 
 13  hours           1450 non-null   object 
 14  photo_id        1705 non-null   object 
 15  adjusted_score  1705 non-null   float64
dtypes: float64(4), int64(2), object(10)
memory usage: 213.2+ KB
None


In [10]:

user = user[(user['review_count'] > 0) & (user['average_stars'] != 0)].reset_index(drop=True)
print(user.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1987843 entries, 0 to 1987842
Data columns (total 22 columns):
 #   Column              Dtype  
---  ------              -----  
 0   user_id             object 
 1   name                object 
 2   review_count        int64  
 3   yelping_since       object 
 4   useful              int64  
 5   funny               int64  
 6   cool                int64  
 7   elite               object 
 8   friends             object 
 9   fans                int64  
 10  average_stars       float64
 11  compliment_hot      int64  
 12  compliment_more     int64  
 13  compliment_profile  int64  
 14  compliment_cute     int64  
 15  compliment_list     int64  
 16  compliment_note     int64  
 17  compliment_plain    int64  
 18  compliment_cool     int64  
 19  compliment_funny    int64  
 20  compliment_writer   int64  
 21  compliment_photos   int64  
dtypes: float64(1), int64(16), object(5)
memory usage: 333.7+ MB
None


In [11]:

unique_user_ids = set(user.user_id.unique())
unique_business_ids = set(business.business_id.unique())

review = review[(review.user_id.isin(unique_user_ids)) & 
                (review.business_id.isin(unique_business_ids)) & 
                (review.stars != 0)]

review.reset_index(drop=True, inplace=True)
print(review.info())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244323 entries, 0 to 244322
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   review_id    244323 non-null  object
 1   user_id      244323 non-null  object
 2   business_id  244323 non-null  object
 3   stars        244323 non-null  int8  
 4   useful       244323 non-null  int8  
 5   funny        244323 non-null  int8  
 6   cool         244323 non-null  int8  
 7   text         244323 non-null  object
 8   date         244323 non-null  object
dtypes: int8(4), object(5)
memory usage: 10.3+ MB
None


In [12]:
photo = photo[photo.business_id.isin(unique_business_ids)]
photo.reset_index(drop=True, inplace=True)
print(photo.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6901 entries, 0 to 6900
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   photo_id     6901 non-null   object
 1   business_id  6901 non-null   object
 2   caption      6901 non-null   object
 3   label        6901 non-null   object
dtypes: object(4)
memory usage: 215.8+ KB
None


## Exporting Data

With the data now preprocessed, the final step involves exporting the modified DataFrames to both CSV and pickle formats for easy access in future analyses.


In [13]:
business.to_csv(path_or_buf='business.csv',index=False)
user.to_csv(path_or_buf='user.csv',index=False)
review.to_csv(path_or_buf='review.csv',index=False)
photo.to_csv(path_or_buf='photo.csv',index=False)

business.to_pickle('business.pkl')
user.to_pickle('user.pkl')
review.to_pickle('review.pkl')
photo.to_pickle('photo.pkl')

review[['user_id', 'business_id', 'stars', 'date']].to_pickle('review_cache.pkl')
