# CM3060 Natural Language Processing
### Coursework Assignment: Text classification

## Introduction

Sentiment analysis, a branch of text classification within Natural Language Processing (NLP), provides both theoretical frameworks and practical techniques for identifying emotions or attitude in text. It is used in numerous fields including social media analysis, product review assessment, customer service, and market research. This project explores key areas of NLP by selecting appropriate sentiment analysis methods and applying them to a real-world dataset, addressing a particular text classification challenge in interpreting customer feedback from online reviews.

### Problem area

According to a Local Consumer Review Survey (Yelp, 2012) "The majority of consumers surveyed use online reviews to make spending decisions." and "72% of consumers give the same weight to online reviews as they do to personal recommendations." Luca (2011) states that a one-star increase in Yelp rating leads to a 5-9 percent increase in revenue and that online consumer reviews substitute for more traditional forms of reputation. All of this suggests that online reviews are a valuable source of information for businesses and consumers alike. However, the huge volume of text data generated by online reviews makes it difficult to manually analyze and interpret customer opinions. Sentiment analysis provides a solution to this problem by automatically classifying the sentiment of text, allowing businesses to efficiently analyze customer feedback, track tendencies, and make informed decisions.


With the rapid development of NLP and machine learning, a number of universal sentiment analysis methods have been proposed. There are free and commercial sentiment analysis tools available, such as:

* [Natural Language Toolkit (NLTK)](https://www.nltk.org/) a Python library provides free and open-source Sentiment Analysis tool based on machine learning approaches in a [nltk.sentiment.sentiment_analyzer](https://www.nltk.org/api/nltk.sentiment.sentiment_analyzer.html) module.

* [TextBlob](https://textblob.readthedocs.io/en/dev/) another Python library for processing textual data. It provides a simple API for NLP tasks, particularly sentiment analysis.

* [Amazon Comprehend Sentiment Analysis](https://docs.aws.amazon.com/comprehend/latest/dg/how-sentiment.html) a commercial sentiment analysis tool provided by Amazon Web Services (AWS).

However, these tools may not always perform well in a specific area. For example, the sentiment of a review may be influenced by the product category, and the same word may have different meanings in different domains. While these tools can give a baseline level of performance for comparison, creating a model tailored to a specific domain could enhance the accuracy of classifying sentiments.

### Objectives

The main goal of this project is to develop a machine learning model for sentiment analysis of online reviews. This model will be trained using a dataset of customer reviews at Yelp, applying supervised learning methods. The star ratings accompanying the reviews will serve as labels, guiding the model to categorize the reviews into positive, negative, or neutral sentiment categories based on their textual content.

The performance of this model will be compared against existing general sentiment analysis tools mentioned above. Through this comparison, the project aims to determine whether a domain-specialized model can enhance the accuracy of sentiment detection in specific context.

Additionally, the project will explore how star ratings in online reviews correlate with expressed sentiments. This investigation will contribute to a deeper understanding of sentiment analysis within NLP, providing meaningful insights for both the business sector and academic researchers.

### Dataset

The [Yelp Open Dataset](https://www.yelp.com/dataset) is a dataset provided by Yelp Inc contains a curated sample of Yelp's businesses, reviews, and user data. As stated in their [terms of use](https://s3-media0.fl.yelpcdn.com/assets/srv0/engineering_pages/f64cb2d3efcc/assets/vendor/Dataset_User_Agreement.pdf), "The Data is made available by Yelp Inc. (“Yelp”) to enable you to access valuable local information to develop an academic project as part of an ongoing course of study." Therefore, the dataset is suitable and permitted for use in this project.

The dataset is available in JSON format and can be downloaded from [here](https://www.yelp.com/dataset/) upon request. Yelp provides a documentation for the dataset [here](https://www.yelp.com/dataset/documentation/main). The dataset contains 6,685,900 reviews, 192,609 businesses, 200,000 pictures covering 10 metropolitan areas. The dataset is regularly updated and the current version (as of 19 December 2020) contains 8,021,122 reviews, 209,393 businesses, 1,334,097 pictures covering 11 metropolitan areas. For the purpose of this project, a `yelp_academic_dataset_review.json` will be used which contains the following fields:

| Field      | Example | Description |
| ----------- | ------ | ----------- |
| review_id | zdSx_SD6obEhz9VrW9uAWA | string, 22 character unique review id |
| user_id | Ha3iJu77CxlrFm-vQRs_8g | string, 22 character unique user id, maps to the user in user.json |
| business_id | tnhfDv5Il8EaGSXZGiuQGg | string, 22 character business id, maps to business in business.json |
| stars | 4 | integer, star rating |
| date | 2016-03-09 | string, date formatted YYYY-MM-DD |
| text | Great place to hang out after work | string, the review itself |
| useful | 0 | integer, number of useful votes received |
| funny | 0 | integer, number of funny votes received |
| cool | 0 | integer, number of cool votes received |

Among possible issues with the dataset, the dataset is said to be regularly updated and this can be an issue for reproducibility of the results in absence of versioning the dataset. Also, the dataset includes only curated list of reviews called [Recommended Reviews](https://www.yelp-support.com/Recommended_Reviews), which may introduce bias, as the language used in these reviews may be different from the language used in other reviews. Finally, the dataset is not balanced in terms of the number of reviews per star rating, but this can be addressed by using stratified sampling.

### Evaluation methodology

Several metrics will be used to evaluate the model's performance. Common metrics for classification problems include _Accuracy_, _Precision_, _Recall_, and _F1 Score_. _Accuracy_ is the simplest measure, calculated as the ratio of correctly predicted observations to the total number of observations. However, it is not always a reliable metric, especially when the dataset is imbalanced, as it is the case with the Yelp dataset. The _Precision_ and _Recall_ metrics are more suitable for imbalanced datasets.

_Precision_ is defined as:

$\text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}}$

It measures how many of the positively predicted instances are actually positive.

_Recall_ is defined as:

$\text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}}$

This metric shows the proportion of actual positives correctly identified by the model.

Both Precision and Recall are equally important for the purpose of this project. The problem with these metrics is that they are inversely related: increasing one of them often reduces the other. So it is useful to combine them into a single metric, that will equally regard both. This metric is called _F1 Score_ and is defined as:

$\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$

_F1 Score_ is a harmonic mean of Precision and Recall, balances both metrics. The harmonic mean is preferred over a simple average as it penalizes extreme values. The F1 Score is widely regarded as a comprehensive measure for evaluating classification models. For this project, it will serve as the primary metric to measure the performance of the sentiment analysis methods considered in this project.

## Implementation

### Requirements

Python 3.12 or higher is required. The following packages are required:

```bash
pip install -r requirements.txt
```

In [1]:
import pandas as pd

### Data preparation

The dataset should be downloaded from [Yelp](https://www.yelp.com/dataset/), unzipped, and placed in the `yelp_dataset` folder.

Please note: The dataset is quite large and loading takes time. It took 8 minutes on Apple M1 16 GB RAM.

In [2]:
df = pd.read_json("yelp_dataset/yelp_academic_dataset_review.json", orient="records", lines=True)

In [3]:
df

Unnamed: 0,review_id,user_id,business_id,stars,useful,funny,cool,text,date
0,KU_O5udG6zpxOg-VcAEodg,mh_-eMZ6K5RLWhZyISBhwA,XQfwVwDr-v0ZS3_CbbE5Xw,3,0,0,0,"If you decide to eat here, just be aware it is...",2018-07-07 22:09:11
1,BiTunyQ73aT9WBnpR9DZGw,OyoGAe7OKpv6SyGZT5g77Q,7ATYjTIgM3jUlt4UM3IypQ,5,1,0,1,I've taken a lot of spin classes over the year...,2012-01-03 15:28:18
2,saUsX_uimxRlCVr67Z4Jig,8g_iMtfSiwikVnbP2etR0A,YjUWPpI6HXG530lwP-fb2A,3,0,0,0,Family diner. Had the buffet. Eclectic assortm...,2014-02-05 20:30:30
3,AqPFMleE6RsU23_auESxiA,_7bHUi9Uuf5__HHc_Q8guQ,kxX2SOes4o-D3ZQBkiMRfA,5,1,0,1,"Wow! Yummy, different, delicious. Our favo...",2015-01-04 00:01:03
4,Sx8TMOWLNuJBWer-0pcmoA,bcjbaE6dDog4jkNY91ncLQ,e4Vwtrqf-wpJfwesgvdgxQ,4,1,0,1,Cute interior and owner (?) gave us tour of up...,2017-01-14 20:54:15


## References

Yelp Inc. (2012) _Survey: 85% of Consumers Use the Internet to Find Local Businesses_ [Online] Available from: https://blog.yelp.com/businesses/survey-85-of-consumers-use-the-internet-to-find-local-businesses/ [19 December 2023].

Luca, M. (2011) _Reviews, Reputation, and Revenue: The Case of Yelp.Com_ [Online] Available from: https://www.hbs.edu/ris/Publication%20Files/12-016_a7e4a5a2-03f9-490d-b093-8f951238dba2.pdf [19 December 2023].

Yelp Inc. (2023) _Yelp Open Dataset_ [Online] Available from: https://www.yelp.com/dataset [19 December 2023].