# British Airways customer reviews analysis: sentiment analysis
## Dr José M Albornoz
### February 2024

In this notebook we will perform sentiment analysis on the BA customer reviews data using a (state of the art model)[https://huggingface.co/sohan-ai/sentiment-analysis-model-amazon-reviews] that has been pre-trained using Amazon customer reviews. 

# 0.- Imports

In [1]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification
import numpy as np

# 1.- Read reviews data

In [2]:
with open('data/reviews.txt', 'r') as f:
    reviews = f.readlines()

In [3]:
numreviews = len(reviews)
numreviews

3752

# 2.- Data cleansing

We will remove new line characters at the end of each review, as well as `b\'` at the beginning of the review.

In [4]:
for k in range(numreviews):
    reviews[k] = reviews[k].replace("'\n", '').replace("b'", '')

# 3.- Sentiment analysis

In [5]:
model_name = "sohan-ai/sentiment-analysis-model-amazon-reviews"
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")
model = DistilBertForSequenceClassification.from_pretrained(model_name)

In [6]:
sentiment = []
for k in range(numreviews):
    
    temp = reviews[k]
       
    inputs = tokenizer(reviews[k], return_tensors="pt", truncation=True, padding=True)
        
    outputs = model(**inputs)
    
    predicted_label = "positive" if outputs.logits.argmax().item() == 1 else "negative"
    
    sentiment.append(predicted_label)

We now can calculate percentages of negative and positive reviews for BA.

In [7]:
a, b = np.unique(sentiment, return_counts=True)

In [8]:
a

array(['negative', 'positive'], dtype='<U8')

In [9]:
b

array([2449, 1303], dtype=int64)

In [10]:
tot_reviews = b[0] + b[1]

In [11]:
tot_reviews

3752

In [13]:
positive_percentage = b[1]*100/tot_reviews
positive_percentage

34.72814498933902

In [14]:
negative_percentage = b[0]*100/tot_reviews
negative_percentage

65.27185501066099