<a href="https://colab.research.google.com/github/imostafizur/Feedback_Analysis/blob/master/Feedback_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Requirements

In [None]:
!git clone https://github.com/imostafizur/Feedback_Analysis.git

Cloning into 'Feedback_Analysis'...
remote: Enumerating objects: 88, done.[K
remote: Counting objects: 100% (88/88), done.[K
remote: Compressing objects: 100% (60/60), done.[K
remote: Total 88 (delta 54), reused 46 (delta 26), pack-reused 0[K
Receiving objects: 100% (88/88), 17.45 MiB | 11.59 MiB/s, done.
Resolving deltas: 100% (54/54), done.
Updating files: 100% (16/16), done.


In [None]:
!pip install transformers



# Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from scipy.special import softmax

warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline

# Data processing

### Loading Dataset

In [None]:
df = pd.read_csv('/content/Feedback_Analysis/Dataset/updated_merged_dataset.csv')
df.head()

  df = pd.read_csv('/content/Feedback_Analysis/Dataset/updated_merged_dataset.csv')


Unnamed: 0,product_id,product_brand,product_model_name,name,rating,verified,title,body,helpfulVotes
0,B0000SX2UC,,Dual-Band / Tri-Mode Sprint PCS Phone w/ Voice...,,,,,,
1,B0009N5L7K,Motorola,Motorola I265 phone,,,,,,
2,B000SKTZ0S,Motorola,MOTOROLA C168i AT&T CINGULAR PREPAID GOPHONE C...,,,,,,
3,B001AO4OUC,Motorola,Motorola i335 Cell Phone Boost Mobile,,,,,,
4,B001DCJAJG,Motorola,Motorola V365 no contract cellular phone AT&T,,,,,,


###List all the column

In [None]:
list_of_columns = df.columns.tolist()
print(list_of_columns)


['product_id', 'product_brand', 'product_model_name', 'name', 'rating', 'verified', 'title', 'body', 'helpfulVotes']


### Drop 'product_id',  'product_model_name', 'name', 'rating', 'verified', 'title', 'helpfulVotes'

In [None]:
df.drop(['product_id', 'product_model_name', 'name', 'rating', 'verified', 'title', 'helpfulVotes'], axis=1, inplace=True)
df.head()


Unnamed: 0,product_brand,body
0,,
1,Motorola,
2,Motorola,
3,Motorola,
4,Motorola,


### Cunt number of NaN in product_brand and body

In [None]:
nan_product_brand = df['product_brand'].isna().sum()
nan_body = df['body'].isna().sum()

print(f"Number of NaN in product_brand: {nan_product_brand}")
print(f"Number of NaN in body: {nan_body}")


Number of NaN in product_brand: 68910
Number of NaN in body: 1492


In [None]:

df.dropna(subset=['product_brand', 'body'], inplace=True)

nan_product_brand = df['product_brand'].isna().sum()
nan_body = df['body'].isna().sum()

print(f"Number of NaN in product_brand: {nan_product_brand}")
print(f"Number of NaN in body: {nan_body}")


Number of NaN in product_brand: 0
Number of NaN in body: 0


###Save the cleaned dataset

In [None]:
df.to_csv('cleaned_dataset.csv')

### Load the cleaned dataset

In [None]:
df = pd.read_csv('/content/cleaned_dataset.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,product_brand,body
0,1454,Motorola,DON'T BUY OUT OF SERVICE
1,1455,Motorola,I have been with nextel for nearly a year now ...
2,1456,Motorola,"I just got it and have to say its easy to use,..."
3,1457,Motorola,1 star because the phones locked so I have to ...
4,1458,Motorola,The product has been very good. I had used thi...


### Drop Unnamed: 0

In [None]:
df.drop(columns=['Unnamed: 0'], axis=1, inplace=True)


### Show number of column

In [None]:
num_columns = len(df.columns)
print(f"Number of columns: {num_columns}")

Number of columns: 2


###Dataset size

In [None]:
num_rows = len(df)
print(f"Number of rows: {num_rows}")

Number of rows: 67760


### Make the dataset small. Number of rows: 500

In [None]:
df = df.head(500)

In [None]:
df.to_csv('cleaned_dataset.csv')

In [None]:
num_rows = len(df)
print(f"Number of rows: {num_rows}")

Number of rows: 500


# Sentiment analysis Using Huggingface

In [None]:
file_path = "cleaned_dataset.csv"

In [None]:
def main():
    try:
        data = pd.read_csv(file_path)
    except FileNotFoundError:
        print(f"Error: File not found at '{file_path}'")
        return
    except Exception as e:
        print(f"Error reading CSV file: {e}")
        return

    # Load pre-trained model and tokenizer
    MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL)

    # Get unique product brands
    unique_brands = data['product_brand'].unique()

    results = []
    for brand in unique_brands:
        brand_comments = data[data['product_brand'] == brand]['body']
        for comment in brand_comments:
            # Preprocess text (tokenize, convert to tensors)
            while True:
                encoded_input = tokenizer(comment, return_tensors='pt', truncation=True, max_length=512)
                try:
                    # Perform sentiment analysis
                    output = model(**encoded_input)
                    break
                except RuntimeError:  # retry with slightly shorter max_length
                    if encoded_input['input_ids'].shape[1] <= 1:
                        raise
                    encoded_input['input_ids'] = encoded_input['input_ids'][:, :-1]

            scores = output[0][0].detach().numpy()
            scores = softmax(scores)

            # Get sentiment label and score
            scores_dict = {
                'negative': scores[0],
                'neutral': scores[1],
                'positive': scores[2]
            }
            label = max(scores_dict, key=scores_dict.get)
            score = scores_dict[label]
            results.append({'product_brand': brand, 'comment': comment, 'label': label, 'score': score})

    # Create DataFrame of Results
    results_df = pd.DataFrame(results)

    # Summarize Sentiment by Brand
    sentiment_summary = results_df.groupby('product_brand')['label'].value_counts().unstack(fill_value=0)

    # Calculate Proportions
    sentiment_summary_pct = sentiment_summary.div(sentiment_summary.sum(axis=1), axis=0) * 100

    print("\nSentiment Analysis Summary by Brand:")
    print(sentiment_summary)
    print("\nSentiment Proportions by Brand (%):")
    print(sentiment_summary_pct.round(2))

    # Optional: Save the results
    results_df.to_csv("sentiment_analysis_results.csv", index=False)

if __name__ == "__main__":
    main()


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/929 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/501M [00:00<?, ?B/s]

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Sentiment Analysis Summary by Brand:
label          negative  neutral  positive
product_brand                             
Motorola            165       47       238
Nokia                 0        1         2
Samsung              12        8        27

Sentiment Proportions by Brand (%):
label          negative  neutral  positive
product_brand                             
Motorola          36.67    10.44     52.89
Nokia              0.00    33.33     66.67
Samsung           25.53    17.02     57.45


### Negative Review on Camera

In [19]:
def main():
    try:
        data = pd.read_csv(file_path)
    except FileNotFoundError:
        print(f"Error: File not found at '{file_path}'")
        return
    except Exception as e:
        print(f"Error reading CSV file: {e}")
        return

    # Load pre-trained model and tokenizer
    MODEL = f"cardiffnlp/twitter-roberta-base-sentiment-latest"
    tokenizer = AutoTokenizer.from_pretrained(MODEL)
    model = AutoModelForSequenceClassification.from_pretrained(MODEL)

    # Get unique product brands
    unique_brands = data['product_brand'].unique()

    results = []
    for brand in unique_brands:
        brand_comments = data[data['product_brand'] == brand]['body']
        for comment in brand_comments:
            # Preprocess text (tokenize, convert to tensors)
            while True:
                encoded_input = tokenizer(comment, return_tensors='pt', truncation=True, max_length=512)
                try:
                    # Perform sentiment analysis
                    output = model(**encoded_input)
                    break
                except RuntimeError:  # retry with slightly shorter max_length
                    if encoded_input['input_ids'].shape[1] <= 1:
                        raise
                    encoded_input['input_ids'] = encoded_input['input_ids'][:, :-1]

            scores = output[0][0].detach().numpy()
            scores = softmax(scores)

            # Get sentiment label and score
            scores_dict = {
                'negative': scores[0],
                'neutral': scores[1],
                'positive': scores[2]
            }
            label = max(scores_dict, key=scores_dict.get)
            score = scores_dict[label]

            # Append results if the review is negative
            if label == 'negative':
                results.append({'product_brand': brand, 'comment': comment, 'label': label, 'score': score})

    # Create DataFrame of Negative Reviews
    negative_reviews_df = pd.DataFrame(results)

    # Filter for comments mentioning phone cameras
    camera_keywords = ['camera', 'photo', 'picture', 'image', 'lens', 'selfie']
    camera_reviews_df = negative_reviews_df[negative_reviews_df['comment'].str.contains('|'.join(camera_keywords), case=False)]

    # Summarize Sentiment by Brand for Camera-Related Negative Reviews
    sentiment_summary = camera_reviews_df.groupby('product_brand')['label'].value_counts().unstack(fill_value=0)

    # Calculate Proportions
    sentiment_summary_pct = sentiment_summary.div(sentiment_summary.sum(axis=1), axis=0) * 100

    print("\nSentiment Analysis Summary by Brand (Negative Reviews on Camera):")
    print(sentiment_summary)
    print("\nSentiment Proportions by Brand (Negative Reviews on Camera) (%):")
    print(sentiment_summary_pct.round(2))

    # Optional: Save the results
    camera_reviews_df.to_csv("camera_negative_reviews.csv", index=False)

if __name__ == "__main__":
    main()


Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).



Sentiment Analysis Summary by Brand (Negative Reviews on Camera):
label          negative
product_brand          
Motorola              7
Samsung               1

Sentiment Proportions by Brand (Negative Reviews on Camera) (%):
label          negative
product_brand          
Motorola          100.0
Samsung           100.0


# By Downloading Huggingface Model (For use in local Machine)

In [20]:
!git lfs install

Git LFS initialized.


In [22]:
!git clone https://huggingface.co/cardiffnlp/twitter-roberta-base-sentiment-latest

Cloning into 'twitter-roberta-base-sentiment-latest'...
remote: Enumerating objects: 49, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (24/24), done.[K
remote: Total 49 (delta 24), reused 49 (delta 24), pack-reused 0 (from 0)[K
Unpacking objects: 100% (49/49), 541.17 KiB | 2.60 MiB/s, done.
Filtering content: 100% (2/2), 953.57 MiB | 29.80 MiB/s, done.


In [23]:
!ls

camera_negative_reviews.csv  Feedback_Analysis	sentiment_analysis_results.csv
cleaned_dataset.csv	     sample_data	twitter-roberta-base-sentiment-latest


In [27]:
!sudo apt install tree

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following NEW packages will be installed:
  tree
0 upgraded, 1 newly installed, 0 to remove and 45 not upgraded.
Need to get 47.9 kB of archives.
After this operation, 116 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tree amd64 2.0.2-1 [47.9 kB]
Fetched 47.9 kB in 0s (224 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 
Selecting previously unselected package tree.
(Reading database ... 121918 files and directories currently install

In [28]:
!cd twitter-roberta-base-sentiment-latest && tree -L 2

[01;34m.[0m
├── [00mconfig.json[0m
├── [00mmerges.txt[0m
├── [00mpytorch_model.bin[0m
├── [00mREADME.md[0m
├── [00mspecial_tokens_map.json[0m
├── [00mtf_model.h5[0m
└── [00mvocab.json[0m

0 directories, 7 files


In [30]:
import tensorflow as tf
from transformers import RobertaTokenizer, TFRobertaForSequenceClassification

# Path to the model and tokenizer files
model_dir = 'twitter-roberta-base-sentiment-latest'

# Load the tokenizer
tokenizer = RobertaTokenizer.from_pretrained(model_dir)

# Load the model
model = TFRobertaForSequenceClassification.from_pretrained(model_dir, from_pt=True)

# Function to perform sentiment analysis
def predict_sentiment(texts):
    # Tokenize the input text
    inputs = tokenizer(texts, return_tensors='tf', padding=True, truncation=True)

    # Get model predictions
    outputs = model(inputs)
    predictions = tf.nn.softmax(outputs.logits, axis=-1)

    # Get the predicted class
    predicted_classes = tf.argmax(predictions, axis=1)

    return predicted_classes

# Example usage
texts = [
    "I love this product!",
    "This is the worst thing I've ever bought."
]

predicted_classes = predict_sentiment(texts)

# Convert predicted classes to sentiment labels
labels = ['Negative', 'Neutral', 'Positive']  # Adjust based on the number of classes
predicted_labels = [labels[int(pred)] for pred in predicted_classes]

for text, label in zip(texts, predicted_labels):
    print(f"Text: {text} -> Sentiment: {label}")


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFRobertaForSequenceClassification: ['roberta.embeddings.position_ids']
- This IS expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFRobertaForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFRobertaForSequenceClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFRobertaForSequenceClassification for predictions without further training.
Asking to truncate to max_length but no maximum length is provided and the model 

Text: I love this product! -> Sentiment: Positive
Text: This is the worst thing I've ever bought. -> Sentiment: Negative
