# NLP: Product Comment Sentiment Analysis

**Resources**
*   https://www.geeksforgeeks.org/what-is-sentiment-analysis/ (intro)
*   https://www.nltk.org/api/nltk.tokenize.html (tokenize - separate sentence into words)
*   https://youtu.be/9p1KYtYAus8 (vader lexicon tutorial - for sentiment analysis)
*   dataset (#56 from https://www.nltk.org/nltk_data/)
*   dataset downloaded from https://www.kaggle.com/datasets/mdwaquarazam/headphone-dataset-review-analysis

<br>


<br>
Documents (bg study, methods etc will do later)

> Step 1: Get data (csv file)

> Step 2: Pre-processing. Tokenize, Lemmatize, remove stopwords, punctuations etc, then do sentiment analysis

> Step 3: Count vectorizer (frequency each words appeared), then identify similarity (use cosine similarity equation, correlation etc.)

> Step 4: each of us will choose our preferred method (bayes, knn, k-means)

> Step 5: compare the results (from using different methods)







# Getting Data & Setup


In [None]:
# install nltk (terminal line commands)
%pip install nltk
%pip install pandas

import nltk
import site
import pandas as pd    # to allow us to read csv file
import random
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize # for tokenization of words
from nltk.stem import WordNetLemmatizer # for lemmatization
import re # python's 're' module to create empty strings with pattern that matches pucnctuation marks

nltk.download('all')
site.getsitepackages() # to find our current path, we imported site, and getsitepackages()

# Read data

In [4]:
data = pd.read_csv('headphone_datn.csv')

In [5]:
# data.head()
total = data.count()
print(total)

Customer_Name    1604
REVIEW_TITLE     1594
Color            1604
REVIEW_DATE      1604
COMMENTS         1546
RATINGS          1604
dtype: int64


# Pre-Processing

In [7]:
# create pre processing function

def preprocess_text(text):
  # remove punctuations
  data['COMMENTS'].dropna(inplace=True)
  data['COMMENTS'] = data['COMMENTS'].astype(str)

  tokens = nltk.word_tokenize(text)
  return [w for w in tokens if w.isalpha()]   
  tokens = nltk.word_tokenize(df['COMMENTS'])

  # tokenize the text 
  tokens = word_tokenize(cleaned_text)
  return tokens
  # remove stop words
  filtered_tokens = [token for token in tokens if token not in stopwords.words('english')]

  # lemmatize the tokens
  lemmatizer = WordNetLemmatizer()
  lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]

  #Join the tokens back into a string
  processed_text = ' '.join(lemmatized_tokens)
  return processed_text


data['tokenized'] = data.apply(lambda x: preprocess_text(x['COMMENTS']), axis=1) 
print(data['tokenized'])


0       [Okay, I, was, skeptical, at, first, to, buy, ...
1       [The, earphone, is, worth, what, you, pay, for...
2       [Particularly, for, people, with, sensitive, e...
3       [Built, Quality, lower, wire, is, a, durable, ...
4       [Do, go, with, the, over, all, start, rating, ...
                              ...                        
1599    [Quite, good, sound, qualityAnd, had, impressi...
1600                                                [Osm]
1601    [Earphones, fits, well, onto, the, ears, does,...
1602    [Sound, quality, very, bad, Over, all, very, b...
1603    [This, is, only, for, calls, Mic, is, good, Bu...
Name: tokenized, Length: 1604, dtype: object


In [8]:
analyzer = SentimentIntensityAnalyzer()

# create get_sentiment function

def get_sentiment(text):

    scores = analyzer.polarity_scores(text)
    sentiment = 1 if scores['pos'] > 0 else 0

    return sentiment

# apply get_sentiment function

data['sentiment'] = data['COMMENTS'].apply(get_sentiment)

