<a href="https://colab.research.google.com/github/leovliz02/Data-science-course-4/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Installing and Importing Libraries

In [1]:
import pandas as pd
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

Downloading NLTK Data

In [2]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

Loading CSV File

In [3]:
from google.colab import files

uploaded = files.upload()

Saving review.csv to review.csv


In [4]:
df = pd.read_csv("review.csv")

**Defining Function for Simplifying Data:**
*   Lemmatisation
*   Removal of Stop Words
*   Changing Data to Lower Case
*   Tokenisation of Text

In [5]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [6]:
def preprocess_text(text):
    text = str(text)
    text = text.lower()
    text = re.sub(r"[^a-z\s]", " ", text)
    tokens = word_tokenize(text)
    cleaned = [lemmatizer.lemmatize(tok) for tok in tokens
               if tok not in stop_words and len(tok) > 1]
    return " ".join(cleaned)

Applying Function to Review Column

In [7]:
df['cleaned_review'] = df['review'].apply(preprocess_text)

Sample of Cleaned Text

In [8]:
print(df[['review','cleaned_review']].head())

                                              review  \
0  I bought this hair oil after viewing so many g...   
1  Used This Mama Earth Newly Launched Onion Oil ...   
2  So bad product...My hair falling increase too ...   
3  Product just smells similar to navarathna hair...   
4  I have been trying different onion oil for my ...   

                                      cleaned_review  
0  bought hair oil viewing many good comment prod...  
1  used mama earth newly launched onion oil twice...  
2  bad product hair falling increase much order s...  
3  product smell similar navarathna hair oil stro...  
4  trying different onion oil hair hair healthy p...  


Identifying the Top 5 Reviewed Products

In [9]:
# Count number of reviews per product name
review_counts = df.groupby('name').size().reset_index(name='num_reviews')

# Sort descending
review_counts = review_counts.sort_values(by='num_reviews', ascending=False)

# Show the top product
print (review_counts.head(5))


                                              name  num_reviews
109                             Tata-Tea-Gold-500g           60
25          Dettol-Disinfectant-Cleaner-Home-Fresh           40
18                      Cinthol-Lime-Soap-100-Pack           40
45   Godrej-Security-Solutions-SEEC9060-Electronic           40
54           Himalaya-Moisturizing-Aloe-Vera-200ml           40


Filtering by Tata-Tea-Gold to create subtable tata_df

In [17]:
tata_df = df[df["name"].str.contains("Tata-Tea-Gold-500g", na=False)]

# **Sentiment Analysis With VADER**

Because VADER cannot automatically identify aspects of a given product, aspects that must be analysed must be hardcoded into the model as follows.

As identified earlier, the product with the most reviews is "Tata-Tea-Gold", and I will thus be using this as a sample. Common aspects of this are:

In [18]:
aspects = {
    "taste": ["taste", "flavour", "flavor", "aftertaste"],
    "aroma": ["aroma", "smell", "fragrance"],
    "color": ["color", "colour"],
    "strength": ["strong", "strength", "light", "weak"],
    "price": ["price", "cost", "value", "expensive", "cheap"],
    "quality": ["quality", "fresh", "stale"],
    "packaging": ["pack", "packet", "packaging", "box"],
    "delivery": ["delivery", "shipping", "late", "fast"],
    "quantity": ["quantity", "weight", "amount"]
}



Sentiment Analysis of Tata Tea Gold Reviews with VADER

In [19]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.tokenize import sent_tokenize

nltk.download("punkt")
nltk.download("vader_lexicon")

sia = SentimentIntensityAnalyzer()

def aspect_sentiment_vader(review, aspects_dict):
    sentences = sent_tokenize(str(review))
    aspect_scores = {aspect: [] for aspect in aspects_dict}

    for sent in sentences:
        sent_lower = sent.lower()
        for aspect, keywords in aspects_dict.items():
            if any(word in sent_lower for word in keywords):
                score = sia.polarity_scores(sent)["compound"]
                aspect_scores[aspect].append(score)

    # Average score per aspect
    final_scores = {}
    for aspect, scores in aspect_scores.items():
        if scores:
            final_scores[aspect] = sum(scores) / len(scores)
        else:
            final_scores[aspect] = None

    return final_scores


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Applying to Tata Tea Gold

In [21]:
aspect_results = tata_df["review"].apply(
    lambda x: aspect_sentiment_vader(x, aspects)
)

aspect_df = pd.DataFrame(aspect_results.tolist())

tata_df = pd.concat([tata_df.reset_index(drop=True), aspect_df], axis=1)

overall_aspect_sentiment = aspect_df.mean(numeric_only=True).sort_values(ascending=False)
print(overall_aspect_sentiment)


aroma        0.501829
color        0.492275
strength     0.457997
delivery     0.440938
taste        0.417469
price        0.352680
packaging    0.156890
quantity     0.000000
quality     -0.095425
dtype: float64


# **Sentiment Analysis with SpaCy**

With spacy as well, I will be analysing the sentiments for the product Tata-Tea-Gold-1kg

Setting up Spacy Library

In [22]:
import spacy
import pandas as pd

nlp = spacy.load("en_core_web_sm")

Extracting Aspect-Opinion Pairs

In [25]:
import spacy

nlp = spacy.load("en_core_web_sm")

'''
BAD_ASPECTS = {
    "thing", "stuff", "item", "product",
    "years", "year", "time", "addition",
    "purpose", "sense", "one", "lot", "way",
    "people", "person", "price",
    "brand"
}

BAD_OPINIONS = {
    "many", "much", "some", "any", "few",
    "standard", "available", "current",
    "printed", "tried", "serving", "makes",
    "added", "used", "made", "give", "take",
    "keep", "get", "go", "come", "put"
}
'''

COPULAR_VERBS = {"be", "seem", "feel", "look", "appear"}

def extract_aspect_opinion_pairs(text):
    doc = nlp(text)
    pairs = []

    for sent in doc.sents:
        for token in sent:

            # -------- Case 1: Adjective directly modifying a noun --------
            # e.g. "strong aroma", "good taste"
            if token.dep_ == "amod" and token.head.pos_ == "NOUN":
                aspect = token.head.lemma_.lower()
                opinion = token.lemma_.lower()

                if aspect in BAD_ASPECTS or opinion in BAD_OPINIONS:
                    continue

                # ignore quantity / determiner adjectives
                if token.tag_ in {"JJR", "JJS"} or token.text.lower() in {"many", "few", "much"}:
                    continue

                pairs.append((aspect, opinion, sent.text))


            # -------- Case 2: Adjective complement --------
            # e.g. "taste is good"
            if token.dep_ == "acomp" and token.head.pos_ == "VERB":
                if token.head.lemma_ in COPULAR_VERBS:
                    subj = [w for w in token.head.lefts if w.dep_ in ("nsubj", "nsubjpass")]
                    if subj:
                        aspect = subj[0].lemma_.lower()
                        opinion = token.lemma_.lower()

                        if aspect in BAD_ASPECTS or opinion in BAD_OPINIONS:
                            continue

                        pairs.append((aspect, opinion, sent.text))


            # -------- Case 3: Opinion verb + direct object --------
            # e.g. "I love the taste"
            if token.pos_ == "VERB":
                dobj = [w for w in token.rights if w.dep_ == "dobj"]
                if dobj:
                    aspect = dobj[0].lemma_.lower()
                    opinion = token.lemma_.lower()

                    if aspect in BAD_ASPECTS or opinion in BAD_OPINIONS:
                        continue

                    pairs.append((aspect, opinion, sent.text))

    return pairs


Applying SpaCy to Tata-Tea-Gold Reviews

In [27]:
records = []

for idx, row in tata_df.iterrows():
    text = str(row["review"])
    pairs = extract_aspect_opinion_pairs(text)

    for aspect, opinion, sentence in pairs:
        records.append({
            "review_id": idx,
            "aspect": aspect,
            "opinion": opinion,
            "sentence": sentence
        })

aspect_df = pd.DataFrame(records)
aspect_df.head(200)


Unnamed: 0,review_id,aspect,opinion,sentence
0,0,mahal,try,"I have tried Taj Mahal, Red label, RCM and man..."
1,0,other,different,"I have tried Taj Mahal, Red label, RCM and man..."
2,0,tea,like,I like my tea strong and for that you don't ha...
3,0,much,add,I like my tea strong and for that you don't ha...
4,0,tea,strong,"It is not just too strong tea, it tastes good."
...,...,...,...,...
195,36,tea,make,"For the purpose of making milk tea, this part..."
196,36,combination,strong,"For the purpose of making milk tea, this part..."
197,36,colour,good,"For the purpose of making milk tea, this part..."
198,36,fragrance,please,"For the purpose of making milk tea, this part..."
