# Sentiment Analysis on Fuel Stations
We have succesfully use Google Places API to retrieve all Malaysia Fuel stations' rating and review from Google Maps. This rating and review is important to determine the value of each stations and how they perform based on the market sentiment. Taking some references from ChatGPT, below are the definitions of sentiment analysis:

### Sentiment Analysis: A Brief Overview

**Sentiment analysis**, also known as opinion mining, is the process of identifying and categorizing opinions expressed in a piece of text. It is widely used to determine whether a piece of text conveys a positive, negative, or neutral sentiment. This technique is essential for businesses, especially in analyzing customer feedback, product reviews, and social media data, to gauge public opinion.

#### How Sentiment Analysis Works:
1. **Text Processing:**
   The first step in sentiment analysis involves cleaning the text data by removing irrelevant information (e.g., special characters, stop words) and then tokenizing it into individual words or phrases. The processed text is then ready for analysis.

2. **Sentiment Classification:**
   Sentiment analysis typically involves classifying text into predefined sentiment categories:
   - **Positive** sentiment: Expresses favorable opinions or emotions.
   - **Negative** sentiment: Shows discontent or unfavorable opinions.
   - **Neutral** sentiment: Reflects a neutral stance without strong emotion.

   Sentiment classification can be achieved using various methods:
   - **Lexicon-based approach:** Relies on pre-defined dictionaries (lexicons) of words labeled with their sentiment values. For example, words like "good" or "excellent" are associated with positive sentiments, while words like "bad" or "poor" indicate negative sentiments.
   - **Machine learning-based approach:** Uses labeled datasets to train models such as Naive Bayes, Support Vector Machines, or neural networks, which learn to classify sentiment from text automatically.
   - **Pre-trained language models:** Models like BERT or GPT-3 can analyze text by capturing the context and nuances, which are often missed by traditional methods.

3. **Applications of Sentiment Analysis:**
   - **Customer feedback analysis:** Businesses use sentiment analysis to assess customer opinions about their products or services.
   - **Market research:** It helps understand market trends and consumer preferences.
   - **Social media monitoring:** Companies analyze social media posts to track brand reputation and public perception.
   - **Political analysis:** Sentiment analysis can be used to measure public sentiment about political issues or candidates.

#### Sentiment Analysis Tools:
Some popular tools for sentiment analysis include:
- **TextBlob:** A simple Python library that calculates sentiment polarity and subjectivity.
- **VADER (Valence Aware Dictionary and Sentiment Reasoner):** Designed for sentiment analysis of social media texts, handling emojis, slang, and capitalization effectively.
- **NLTK (Natural Language Toolkit):** A more comprehensive library for various NLP tasks, including sentiment analysis.

#### Challenges in Sentiment Analysis:
- **Sarcasm and irony:** These can be difficult to detect, as they may use positive words to convey a negative meaning.
- **Context:** Words that carry different sentiments in different contexts require advanced models to interpret them accurately.
- **Ambiguity:** Words or phrases with multiple meanings can pose a challenge, particularly in short texts like tweets.

### Sources:
- Cambria, E., Schuller, B., Xia, Y., & Havasi, C. (2013). New Avenues in Opinion Mining and Sentiment Analysis. *IEEE Intelligent Systems*, 28(2), 15-21.
- Zhang, L., Wang, S., & Liu, B. (2018). Deep learning for sentiment analysis: A survey. *Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery*, 8(4).
- Hutto, C., & Gilbert, E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. *Eighth International Conference on Weblogs and Social Media (ICWSM-14)*.

This should provide you with a concise introduction to sentiment analysis and how it works! Let me know if you need more details.

In [1]:
#let's load the data first and perform some preprocessing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('fuel_station_reviews.csv')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9898 entries, 0 to 9897
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    9898 non-null   int64 
 1   station_name  9898 non-null   object
 2   place_name    9898 non-null   object
 3   location      9898 non-null   object
 4   author_name   9898 non-null   object
 5   rating        9898 non-null   int64 
 6   text          9631 non-null   object
 7   time          9898 non-null   int64 
dtypes: int64(3), object(5)
memory usage: 618.8+ KB


The challenges in sentiment analysis is the language itself. Since the reviews are written by people in Malaysia, obviously, they're not speaking good english and maybe using multiple languages. Let's look at the review data first

In [2]:
# Fill missing values (NaN) with an empty string
df['text'] = df['text'].fillna('')

# Convert all values in 'Text' column to string
df['text'] = df['text'].astype(str)

# Combine all the reviews in the 'Text' column into one string
combined_reviews = " ".join(df['text'].tolist())

# Display the combined string (display only the first 1000 characters
print(combined_reviews[:1000])


Another station charging higher than normal diesel fuel with B10, not giving consumers access to the cheaper diesel blend of Euro5. This practice should be prohibited! Got petrol here most of the staff are very helpful, kind and always smiling Staff always with smiling face Its okay, typical petrol station. One of two choices you can opt when in parit buntar town
Next to this petrol pump got another petrol pumo that is petronas
There is pizza restaurant attacthed to the main building of this petrol pump
The offer sometimes are very good to avail

The only drawback is you have to find a good spot if u wish to make a uturn and it really difficult for big vehicle One of the best Petrol Station in Malaysia, the service was unbelievable! All the staff was very helpful and kind, they treat us like their friend. Second is the toilet. It was so cozy, beautiful and the most important is clean. The third one is the delicious "on the go" food. There's a lovely Pizza and Bun served there. The also

Seems like most of the reviews that we extracted are in english, although it seems broken. Let's analyze this more strucuturally. Let's use language detect library to detect the language

In [3]:
from langdetect import detect, LangDetectException
import pandas as pd

# Function to detect language with exception handling
def detect_language(text):
    try:
        # Check if the text is valid and long enough for detection
        if isinstance(text, str) and len(text.strip()) > 3:  # You can adjust the length threshold
            return detect(text)
        else:
            return 'unknown'
    except LangDetectException:
        return 'unknown'

# Apply the language detection function to each row
df['Language'] = df['text'].apply(detect_language)

# Display the result
df


Unnamed: 0.1,Unnamed: 0,station_name,place_name,location,author_name,rating,text,time,Language
0,0,TAMAN ANDA,Petronas Taman Anda,"4.629239,101.114431",Stephen ong,3,Another station charging higher than normal di...,1676626422,en
1,1,TAMAN ANDA,Petronas Taman Anda,"4.629239,101.114431",Che Syaiful,4,Got petrol here,1726010026,en
2,2,TAMAN ANDA,Petronas Taman Anda,"4.629239,101.114431",Mohammad Fithri Mohammad Sharifuddin,4,"most of the staff are very helpful, kind and a...",1601110289,en
3,3,TAMAN ANDA,Petronas Taman Anda,"4.629239,101.114431",Stephen lee,5,Staff always with smiling face,1600260670,en
4,4,TAMAN ANDA,Petronas Taman Anda,"4.629239,101.114431",Iqbal Hakim (Kim),5,"Its okay, typical petrol station.",1614015946,en
...,...,...,...,...,...,...,...,...,...
9893,9893,BATU 5 GOMBAK,Petronas Batu 5 Jalan Gombak ( Mesra Ikhwan ),"3.212697,101.708571",raja aziera syahfiqah,1,Toilet is very bad like no maintenance taken a...,1709637382,en
9894,9894,BATU 5 GOMBAK,Petronas Batu 5 Jalan Gombak ( Mesra Ikhwan ),"3.212697,101.708571",Mohd Syamsul Maksud,1,"Worst petro station, 50 years Pet yesterday, b...",1723973100,en
9895,9895,BATU 5 GOMBAK,Petronas Batu 5 Jalan Gombak ( Mesra Ikhwan ),"3.212697,101.708571",Liza Idayu Zakaria,1,The toilet is soooo dirty. Flush is out of ord...,1697432855,en
9896,9896,BATU 5 GOMBAK,Petronas Batu 5 Jalan Gombak ( Mesra Ikhwan ),"3.212697,101.708571",Izmine Azmine,4,Spaces n convinience atmosphere to drop by ...👌😊,1719944438,en


In [4]:
#let's see if there's any undetected language
df['Language'].unique()

array(['en', 'unknown', 'ca', 'et', 'so', 'it', 'ro', 'da', 'fr', 'nl',
       'sv', 'cy', 'es', 'af', 'hr', 'tl', 'id', 'no', 'de', 'cs', 'sk',
       'fi', 'vi', 'sl', 'ko', 'pt', 'sw', 'lt', 'hu', 'pl', 'tr',
       'zh-cn'], dtype=object)

In [5]:

df.loc[df['Language']=='zh-cn']

Unnamed: 0.1,Unnamed: 0,station_name,place_name,location,author_name,rating,text,time,Language
7229,7229,JURU LAYBY ARAH UTARA,Shell,"5.35459,100.41615",shu theng Phuah,1,服务员态度不好,1723247079,zh-cn


Yeah, it seems like we have multiple languages here available in our review. Using langdetect is good to detect any other languages apart from English before we proceed with sentiment analysis. Since we have multiple languages available, let's use a more comprehensive approach in our sentiment analysis

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from torch.utils.data import DataLoader, Dataset

# Check if GPU is available and set the device accordingly
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Initialize the sentiment analysis model and tokenizer
model_name = "nlptown/bert-base-multilingual-uncased-sentiment"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model = model.to(device)

# Dataset class to handle the text data
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length=512):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        # Tokenize the text with truncation and padding
        inputs = self.tokenizer(
            text, 
            max_length=self.max_length, 
            truncation=True, 
            padding="max_length", 
            return_tensors="pt"
        )
        return {key: val.squeeze(0) for key, val in inputs.items()}  # Remove batch dimension

# Function to calculate sentiment for a batch of inputs
def get_batch_sentiment(texts, batch_size=16):
    dataset = TextDataset(texts, tokenizer)
    dataloader = DataLoader(dataset, batch_size=batch_size)

    sentiment_labels = []
    sentiment_scores = []

    model.eval()  # Set the model to evaluation mode

    label_map = {0: 'very negative', 1: 'negative', 2: 'neutral', 3: 'positive', 4: 'very positive'}

    with torch.no_grad():  # Disable gradient calculation for efficiency
        for batch in dataloader:
            # Move the batch to the device (GPU or CPU)
            batch = {key: val.to(device) for key, val in batch.items()}

            # Get model outputs
            outputs = model(**batch)
            predictions = torch.softmax(outputs.logits, dim=1)

            # Get the highest confidence score and corresponding label
            confidences, predicted_classes = torch.max(predictions, dim=1)

            # Convert predictions to sentiment labels and scores
            for confidence, predicted_class in zip(confidences, predicted_classes):
                sentiment_labels.append(label_map[predicted_class.item()])
                sentiment_scores.append(confidence.item())

    return sentiment_labels, sentiment_scores

# Apply the batch sentiment analysis to the dataset
df['text'] = df['text'].fillna('')  # Fill missing values
batch_size = 16  # Adjust batch size depending on your available memory

# Get the sentiment labels and scores in batches
df['Sentiment'], df['Sentiment_Score'] = get_batch_sentiment(df['text'].tolist(), batch_size=batch_size)

# Display the result
print(df[['text', 'Sentiment', 'Sentiment_Score']])

# Optional: Save the output to a CSV file
df.to_csv('sentiment_results.csv', index=False)



                                                   text      Sentiment  \
0     Another station charging higher than normal di...  very negative   
1                                       Got petrol here  very positive   
2     most of the staff are very helpful, kind and a...  very positive   
3                        Staff always with smiling face  very positive   
4                     Its okay, typical petrol station.        neutral   
...                                                 ...            ...   
9893  Toilet is very bad like no maintenance taken a...  very negative   
9894  Worst petro station, 50 years Pet yesterday, b...  very negative   
9895  The toilet is soooo dirty. Flush is out of ord...  very negative   
9896   Spaces n convinience atmosphere to drop by ...👌😊       negative   
9897  Love all the personnel working here. They are ...  very positive   

      Sentiment_Score  
0            0.647881  
1            0.429729  
2            0.549436  
3            0.