# Yelp Dataset Exploration

In this notebook, we will explore the Yelp Dataset which can be downloaded here: https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset?select=yelp_academic_dataset_review.json

This dataset is a subset of Yelp's businesses, reviews, and user data, which has been made available for personal, educational, and academic purposes.

We will start by loading the data and inspecting its structure and content. This will help us understand the data better and plan our analysis accordingly.

In [2]:
import pandas as pd

# Load the data
path = 'data/yelp_academic_dataset_review.json' # Either download the data from the link above or use the reduced dataset provided in the folder "data"
# path = 'data/yelp_academic_dataset_review_reduced_5k.json'

data = pd.read_json(path, lines=True)

# Display the first few rows of the dataframe
data.head()

The dataset contains the following columns:

- `review_id`: A unique identifier for the review.
- `user_id`: The ID of the user who wrote the review.
- `business_id`: The ID of the business that the review is about.
- `stars`: The star rating given by the user, from 1 to 5.
- `useful`: The number of 'useful' votes received by the review.
- `funny`: The number of 'funny' votes received by the review.
- `cool`: The number of 'cool' votes received by the review.
- `text`: The text of the review.
- `date`: The date when the review was posted.

Next, let's check the size of the dataset and see if there are any missing values.

In [None]:
# Check the size of the dataset
print('Number of rows:', data.shape[0])
print('Number of columns:', data.shape[1])

# Check for missing values
data.isnull().sum()

The dataset contains 6990280 rows and 9 columns. There are no missing values in the dataset, which is great as it means we won't have to deal with imputation or deciding how to handle them.

Next, let's get some basic statistics about the numerical columns in the dataset.

In [None]:
# Get basic statistics about the numerical columns
data.describe()

From the basic statistics, we can observe that:

- The average star rating is around 3.75, with a standard deviation of 1.48. This suggests that the ratings are relatively spread out around the mean.
- The 'useful', 'funny', and 'cool' columns have a lot of 0s, as indicated by their 25th, 50th (median), and 75th percentiles. This means that many reviews did not receive any votes in these categories.
- The maximum number of 'useful' votes a review received is 1182, while the maximum number of 'funny' and 'cool' votes are 792 and 404, respectively.

Next, we can explore the distribution of star ratings in the dataset.

In [None]:
import matplotlib.pyplot as plt

# Plot the distribution of star ratings
plt.figure(figsize=(10, 6))
data['stars'].value_counts().sort_index().plot(kind='bar')
plt.xlabel('Star Ratings')
plt.ylabel('Number of Reviews')
plt.title('Distribution of Star Ratings')
plt.show()

The distribution of star ratings shows that most reviews have high ratings (4 or 5 stars). There are fewer reviews with low ratings (1 or 2 stars), and the number of 3-star reviews is also relatively low.

This concludes our initial exploration of the Yelp Academic Dataset. We have loaded the data, inspected its structure and content, checked for missing values, and explored the distribution of star ratings. This dataset can be used for various purposes, such as sentiment analysis, recommendation systems, and more.

## Text Length Analysis

Let's start by creating a new feature that represents the length of each review. We'll then analyze if there's a correlation between the length of a review and the number of stars given.

In [None]:
# Create a new feature for the length of each review
data['text_length'] = data['text'].apply(len)

# Display the first few rows of the dataframe
data.head()

In [None]:
# Plot the relationship between text length and star ratings
plt.figure(figsize=(10, 6))
for i in range(1, 6):
    plt.hist(data[data['stars'] == i]['text_length'], bins=30, alpha=0.5, label=f'{i} Stars')
plt.xlabel('Text Length')
plt.ylabel('Number of Reviews')
plt.title('Text Length vs. Star Ratings')
plt.legend()
plt.show()

From the histogram, we can observe that the distribution of text length is similar across all star ratings. However, there seems to be a slightly higher number of long reviews for 1-star and 2-star ratings compared to 4-star and 5-star ratings. This could suggest that users tend to write longer reviews when they have negative experiences.

## Dataset Relevance for the Project

For the purpose of our project, which is predicting the star rating of a review, not all columns in the dataset are relevant. The most important columns for our task are:

- `text`: This column contains the text of the review. This is the main data that we will use to predict the star rating. We will use natural language processing (NLP) techniques to convert this text data into a format that can be used by a machine learning model.

- `stars`: This is the target variable that we want to predict. It represents the star rating given by the user, from 1 to 5.

The other columns in the dataset could potentially provide useful information for other types of analysis. For example, the `user_id` and `business_id` could be used for building a recommendation system. However, for our specific task of predicting the star rating based on the review text, these columns are not needed.