## Introduction:
In the era of digital transformation, online reviews have become a powerful driving force influencing consumer choices and business reputation. Yelp, a prominent user-generated review platform, houses a vast repository of valuable insights encapsulated within millions of reviews. This project delves into the Yelp review dataset, employing data cleaning and natural language processing techniques to uncover hidden patterns and sentiments. By analyzing this rich textual data, we aim to assist businesses in understanding customer feedback, enhancing their services, and gaining an edge in an ever-competitive market. Let us embark on a journey of data exploration and sentiment analysis, revealing the untold story behind Yelp's myriad reviews.

## Motivation and Dataset Description:
Yelp is a popular online platform where users can provide reviews and ratings for businesses ranging from restaurants, cafes, and bars to local services and shops. The Yelp review dataset offers a wealth of information from millions of user-generated reviews, making it a valuable resource for conducting data-driven analyses and gaining insights into customer sentiments and behavior.

The primary motivation behind this project is to explore the Yelp review dataset and leverage natural language processing (NLP) techniques to uncover patterns, sentiments, and trends hidden within the vast volume of textual data. By harnessing NLP, we aim to extract valuable information from the reviews, such as sentiment polarity, key phrases, and common topics, that can aid businesses in understanding customer feedback and enhancing their services.

## Dataset Description:
The Yelp review dataset is a comprehensive collection of user reviews, encompassing multiple attributes such as user IDs, business IDs, review text, and star ratings. The dataset spans diverse geographical locations, businesses, and user demographics, providing a rich and varied set of reviews.

The main columns in the dataset include:

- `review_id`: A unique identifier for each review.
- `user_id`: The identifier of the user who wrote the review.
- `business_id`: The identifier of the business being reviewed.
- `stars`: The star rating given by the user (ranging from 1 to 5).
- `useful`, `funny`, `cool`: The counts of how many users marked the review as useful, funny, or cool.
- `text`: The textual content of the review.
- `date`: The date and time when the review was posted.

The dataset contains millions of reviews, making it a massive corpus of text that can be mined for valuable insights. However, like any real-world dataset, it also presents challenges, such as missing data, potential outliers, and the need for proper data cleaning and preprocessing.

By analyzing the Yelp review dataset, we aim to unravel patterns of user behavior, perform sentiment analysis to gauge customer sentiments, and contribute to a better understanding of user feedback on various businesses.

## Research Question
How well can sentiment analysis accurately predict the sentiment polarity of Yelp reviews compared to the user-provided star ratings?

In this study, we aim to employ Natural Language Processing (NLP) techniques, such as the VADER SentimentIntensityAnalyzer, to extract sentiment scores from the textual content of Yelp reviews. The research question focuses on comparing these sentiment scores with the star ratings assigned by users to assess the level of alignment between sentiment and rating.

Through this analysis, we seek to explore whether the extracted sentiments closely align with the star ratings or if there are instances of discrepancies. By validating the sentiment analysis against user ratings, we aim to gain insights into the efficacy of NLP methods for capturing sentiment from reviews and uncover any underlying patterns in the user feedback.

## Get Data

In [3]:
import sys
import os

sys.path.append(os.path.abspath('../src'))

from extract_data import process_json_files


KeyboardInterrupt: 

## EDA

## Data Cleaning