## Problem Statement
Create a simple Python script that uses a pre-trained language model (e.g., GPT-3, GPT-4) to perform sentiment analysis on a set of product reviews. The script should read the reviews from a text file, analyze the sentiment of each review (positive, negative, neutral), and output the results to another text file.

#### Working with the data

In [1]:
import pandas as pd

In [6]:
df = pd.read_table("data/amazon_alexa.tsv")
df.head()

Unnamed: 0,rating,date,variation,verified_reviews,feedback
0,5,31-Jul-18,Charcoal Fabric,Love my Echo!,1
1,5,31-Jul-18,Charcoal Fabric,Loved it!,1
2,4,31-Jul-18,Walnut Finish,"Sometimes while playing a game, you can answer...",1
3,5,31-Jul-18,Charcoal Fabric,I have had a lot of fun with this thing. My 4 ...,1
4,5,31-Jul-18,Charcoal Fabric,Music,1


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3150 entries, 0 to 3149
Data columns (total 5 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   rating            3150 non-null   int64 
 1   date              3150 non-null   object
 2   variation         3150 non-null   object
 3   verified_reviews  3149 non-null   object
 4   feedback          3150 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 123.2+ KB


In [8]:
df.isnull().sum()

rating              0
date                0
variation           0
verified_reviews    1
feedback            0
dtype: int64

In [9]:
df['rating'].value_counts()

rating
5    2286
4     455
1     161
3     152
2      96
Name: count, dtype: int64

In [11]:
df['feedback'].value_counts()

feedback
1    2893
0     257
Name: count, dtype: int64

In [12]:
df.shape

(3150, 5)

In [15]:
def read_reviews(input_file):
    try:
        df = pd.read_table(input_file) # read_tables() for .tsv file, read_csv() for .csv files
        df.dropna(inplace=True)
        df = df[['rating', 'verified_reviews']]
        return df

    except Exception as e:
        print(f"Error reading input file: {e}")
        return []

In [18]:
test_df = read_reviews("data/amazon_alexa.tsv")

In [19]:
test_df.head()

Unnamed: 0,rating,verified_reviews
0,5,Love my Echo!
1,5,Loved it!
2,4,"Sometimes while playing a game, you can answer..."
3,5,I have had a lot of fun with this thing. My 4 ...
4,5,Music


### Initilising Hugging Face Transformers for Sentiment Analysis

In [3]:
!pip install transformers



In [5]:
from transformers import pipeline

In [6]:
# Testing the installation
pipe = pipeline(model="FacebookAI/roberta-large-mnli")
pipe("This restaurant is awesome")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/688 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.43G [00:00<?, ?B/s]

Some weights of the model checkpoint at FacebookAI/roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

[{'label': 'NEUTRAL', 'score': 0.7313135862350464}]

### Initialising the pipeline with _'zero-shot-classification'_ and a suitable model for it

In [18]:
pipe = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

In [26]:
# writing the prompt

def analyze_sentiment(rating, review):

    labels = ["Positive", "Negative", "Neutral"]
    response = pipe("Rating: {rating}\n\nReview: {review}",
                     candidate_labels= labels)

    return labels, response['scores']

In [30]:
sentiment = list() # to store the sentiment of each review

for row in test_df.head().itertuples():
    rating = row.rating
    review = row.verified_reviews

    # getting the sentiment scores
    labels, scores = analyze_sentiment(rating, review)
    
    # getting the sentiment with max score
    maxi = 0
    for i in range(len(scores)):
      if scores[i] > scores[maxi]:
        maxi = i

    senti = str(labels[maxi])

    # appending the sentiment to the list
    sentiment.append(senti)

In [31]:
sentiment

['Positive', 'Positive', 'Positive', 'Positive', 'Positive']

In [34]:
# creating a new dataframe with the sentiments
output_df = test_df
output_df['sentiment'] = sentiment

In [36]:
# saving the ouput in to a csv file
output_df.to_csv("/review_sentiments.csv")