# Problem Statement
An e-commerce company wants to improve customer satisfaction by understanding the sentiment behind customer reviews. They aim to identify key factors influencing positive and negative sentiments to enhance their products and services accordingly.

# Project Scenario
Recently, I finished reading the life-changing book "Atomic Habits" by James Clear. As this book significantly altered my daily life, work habits, and ability to build positive routines, I became curious about others' perspectives. After searching online, I found the book on Amazon and noticed a mix of reviews from readers. Inspired by this, I devised a plan to scrape these reviews, analyze the sentiments expressed by customers, and identify the factors influencing both positive and negative feedback.

# Project Goal
- Analyze customer sentiment from e-commerce website (Amazon) reviews to understand factors driving feedback.
- Provide insights into customer preferences and concerns for data driven decision-making.

## Deliverables
- Data Collection: extract customer reviews and additional data from popular e-commerce websites.
- Data Cleaning & Preprocessing: Handle missing values, duplicates, and irrelevant features; perform text preprocessing tasks.
- Sentiment Analysis: Analyze sentiment using NLP techniques.
- Feature Engineering: Extract features from text data and incorporate additional features.
- Machine Learning Modeling: Build models to predict sentiment.
- Data Visualization and Interpretation: Visualize sentiment analysis results and feature importance.
- Solution and Recommendations: Summarize findings and provide actionable recommendations.

# Stakeholders: WHY CUSTOMER SENTIMENT ANALYSIS?

1. **Retailers**: Whether it's on Amazon, Flipkart, or Shopify, retailers can benefit greatly from understanding customer sentiments. By learning from this project, they can refine their product offerings and marketing strategies to create happier customers and drive more sales.

2. **Authors & Publishers**: Understanding how readers feel about their work is vital for authors and publishers. By uncovering insights from customer sentiments, they can improve future books, connect better with their audience, and boost sales and loyalty.

3. **Data Enthusiasts**: Scraping Amazon reviews was no easy feat, and I realized there's a lack of accessible tutorials out there. So, I'm creating a comprehensive guide to help others navigate through similar challenges and empower fellow data enthusiasts in their projects.

# PART ONE: Data Scraping


# Step One: Data Collection
### Tasks
- Extract customer reviews and additional data from Amazon.

**What are the available ways to Scrape data?**

- I used **BeautifulSoup** in this project. It is the most popular package in python used to parse HTML & XML data.
- Scrapy: Scrapy is a powerful and flexible framework for web scraping in Python.
- Selenium: a popular tool, Selenium is particularly useful for scraping websites that use JavaScript to generate content dynamically.
- Puppeteer (for JavaScript): provides a high-level API for controlling headless Chrome or Chromium browsers. It can be used for tasks such as web scraping, automated testing, and generating screenshots of web pages.
- APIs: Some websites offer APIs (Application Programming Interfaces) that allow developers to access data in a structured format without needing to scrape HTML.
- Commercial scraping tools: Tools such as Mozenda, ParseHub, and Content Grabber, which offer features like point-and-click interfaces, scheduling, and data export options.
- Autoparser: automate web scrapping 
- Octoparse: an easy tool to use without any coding
- AI tools: If you're using Chatgpt 4, you can simply upload the Amazon HTML file and give a prompt, it will do the work for you.

Now, it's your choice to get your hands dirty with some coding or let tools work for you.... :) 

### Imports
For the initial phase of data scraping in our project, we import numpy and pandas for data manipulation tasks, BeautifulSoup for parsing HTML documents, and requests for making HTTP requests to fetch web pages. These libraries collectively provide the necessary tools to efficiently extract and manipulate data from web sources for further analysis.

In [1]:
# Imports
import numpy as np
import pandas as pd

In [2]:
df = pd.read_excel("C:/Users/haide/Downloads/CustomerReviews.xlsx")

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,Reviews
0,0,I previously wrote this review right after rea...
1,1,"""Atomic Habits"" by James Clear is a transforma..."
2,2,I've read a lot of books on changing behavior ...
3,3,Very insightful and easy to read. It’s crazy h...
4,4,"""Atomic Habits"" by James Clear is pretty much ..."


In [4]:
df.shape

(4610, 2)

In [5]:
# drop rows with missing values
df.dropna(inplace=True)
df.shape

(4610, 2)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4610 entries, 0 to 4609
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  4610 non-null   int64 
 1   Reviews     4610 non-null   object
dtypes: int64(1), object(1)
memory usage: 72.2+ KB


# Part Two: Sentiment Analysis with NLP

### Imports
I imported **NLTK** (Natural Language Toolkit) to handle various natural language processing tasks, such as tokenizing text into words, which is crucial for breaking down customer  reviews into analyzable components. After that, I included **stopwords** from NLTK to filter out common words that do not contribute much meaning to the sentiment analysis, allowing the focus to remain on more significant words. Next, I used **word_tokenize** to split text into individual words, making it easier to perform further processing like removing stopwords and stemming. Then, I incorporated **PorterStemmer** to reduce words to their base forms and **WordNetLemmatizer** for linguistically accurate base forms, which helps in normalizing the text data. Last but not the least, I imported the **String** module to remove punctuation from the text, ensuring a cleaner dataset for **Sentiment Analysis**. 

In [7]:
# Imports
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer

import string

The next step is to downloading the resources into the notebook. I downloaded essential NLTK resources including **punkt** for tokenizing text, **stopwords** for filtering out common words, **wordnet** for lemmatization, and **omw-1.4** for accessing wordnet's multilingual data. 

In [8]:
# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\haide\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\haide\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\haide\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\haide\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

The dataset contains 4,610 customer reviews of the book 'Atomic Habits,' with two columns: an index and the review text. Let's explore the data in the reviews column. To do this, I **tokenized the text by converting it to lowercase and splitting it into individual words.**

In [9]:
# Tokenization
df['text_tokens'] = df['Reviews'].apply(lambda x: word_tokenize(x.lower()))
df.head()

Unnamed: 0.1,Unnamed: 0,Reviews,text_tokens
0,0,I previously wrote this review right after rea...,"[i, previously, wrote, this, review, right, af..."
1,1,"""Atomic Habits"" by James Clear is a transforma...","[``, atomic, habits, '', by, james, clear, is,..."
2,2,I've read a lot of books on changing behavior ...,"[i, 've, read, a, lot, of, books, on, changing..."
3,3,Very insightful and easy to read. It’s crazy h...,"[very, insightful, and, easy, to, read, ., it,..."
4,4,"""Atomic Habits"" by James Clear is pretty much ...","[``, atomic, habits, '', by, james, clear, is,..."


In [10]:
df['text_tokens'].head()

0    [i, previously, wrote, this, review, right, af...
1    [``, atomic, habits, '', by, james, clear, is,...
2    [i, 've, read, a, lot, of, books, on, changing...
3    [very, insightful, and, easy, to, read, ., it,...
4    [``, atomic, habits, '', by, james, clear, is,...
Name: text_tokens, dtype: object

In [11]:
# remove stop and punctuation
stop_words = set(stopwords.words('english'))
punctuation = set(string.punctuation)
filtered_tokens = []

for tokens in df['text_tokens']:
    filtered_tokens.append([word for word in tokens if word not in stop_words and word not in punctuation])

In this step, I initialized a set of **English stopwords** using NLTK's stopwords module and a set of **punctuation marks** using Python's string module. These sets are crucial for filtering out **irrelevant words** and **symbols** from the tokenized text data stored in previous column **text_tokens** in our dataframe. I then iterated through each token list in the dataframe, excluding both stopwords and punctuation from the filtered tokens, and stored the results in the **filtered_tokens** list.

In [12]:
# Stemming
stemmer = PorterStemmer()
stemmed_tokens = []
for tokens in filtered_tokens:
    stemmed_tokens.append([stemmer.stem(word) for word in tokens])

Here, I employed **Stemming**, a text normalization technique, to reduce each word in the filtered tokens to its base or root form using the **Porter Stemmer algorithm provided by NLTK**. By iterating through each list of filtered tokens, I applied stemming to improve text analysis accuracy by reducing the complexity of the vocabulary while preserving semantic meaning. The resulting **stemmed_tokens** list contains the stemmed versions of the original tokens, facilitating more efficient and consistent text processing for downstream tasks like sentiment analysis or classification.

In [13]:
# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = []
for tokens in filtered_tokens:
    lemmatized_tokens.append([lemmatizer.lemmatize(word) for word in tokens])

Now I implemented **lemmatization**, a text normalization technique, using the **WordNet Lemmatizer provided by NLTK**. By iterating through each list of filtered tokens, I applied lemmatization to transform words into their base or dictionary forms, which helps in **standardizing the vocabulary and improving text analysis accuracy**. The resulting **lemmatized_tokens** list contains the lemmatized versions of the original tokens, facilitating more accurate and linguistically consistent text processing for tasks such as sentiment analysis or classification.

In [14]:
# Join tokens back to text
df['cleaned_text'] = [' '.join(tokens) for tokens in lemmatized_tokens]

I utilized list comprehension to concatenate the lemmatized tokens into text format and assigned the result to a new column named **cleaned_text**.

In [15]:
df.head()

Unnamed: 0.1,Unnamed: 0,Reviews,text_tokens,cleaned_text
0,0,I previously wrote this review right after rea...,"[i, previously, wrote, this, review, right, af...",previously wrote review right reading book tod...
1,1,"""Atomic Habits"" by James Clear is a transforma...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear transformative ...
2,2,I've read a lot of books on changing behavior ...,"[i, 've, read, a, lot, of, books, on, changing...",'ve read lot book changing behavior building h...
3,3,Very insightful and easy to read. It’s crazy h...,"[very, insightful, and, easy, to, read, ., it,...",insightful easy read ’ crazy simple thought ma...
4,4,"""Atomic Habits"" by James Clear is pretty much ...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear pretty much gam...


In [16]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4610 entries, 0 to 4609
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Unnamed: 0    4610 non-null   int64 
 1   Reviews       4610 non-null   object
 2   text_tokens   4610 non-null   object
 3   cleaned_text  4610 non-null   object
dtypes: int64(1), object(3)
memory usage: 144.2+ KB


In [18]:
# Distribution of review lengths
df['review_length'] = df['cleaned_text'].apply(lambda x: len(x.split()))

Here, I calculated the length of each review in terms of the number of words. I achieved this by applying a lambda function to the **cleaned_text** column of our dataframe. Within the lambda function, I used the split() method to separate the text into individual words and then determined the length of the resulting list, representing the number of words in each review. Finally, I assigned these calculated lengths to a new column named **review_length**.

In [19]:
df.head()

Unnamed: 0.1,Unnamed: 0,Reviews,text_tokens,cleaned_text,review_length
0,0,I previously wrote this review right after rea...,"[i, previously, wrote, this, review, right, af...",previously wrote review right reading book tod...,1684
1,1,"""Atomic Habits"" by James Clear is a transforma...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear transformative ...,138
2,2,I've read a lot of books on changing behavior ...,"[i, 've, read, a, lot, of, books, on, changing...",'ve read lot book changing behavior building h...,332
3,3,Very insightful and easy to read. It’s crazy h...,"[very, insightful, and, easy, to, read, ., it,...",insightful easy read ’ crazy simple thought ma...,24
4,4,"""Atomic Habits"" by James Clear is pretty much ...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear pretty much gam...,179


In [20]:
df = df.rename(columns = {'Reviews': 'review'})

In [21]:
df.head()

Unnamed: 0.1,Unnamed: 0,review,text_tokens,cleaned_text,review_length
0,0,I previously wrote this review right after rea...,"[i, previously, wrote, this, review, right, af...",previously wrote review right reading book tod...,1684
1,1,"""Atomic Habits"" by James Clear is a transforma...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear transformative ...,138
2,2,I've read a lot of books on changing behavior ...,"[i, 've, read, a, lot, of, books, on, changing...",'ve read lot book changing behavior building h...,332
3,3,Very insightful and easy to read. It’s crazy h...,"[very, insightful, and, easy, to, read, ., it,...",insightful easy read ’ crazy simple thought ma...,24
4,4,"""Atomic Habits"" by James Clear is pretty much ...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear pretty much gam...,179


# How to do Sentiment Analysis

### Using TextBlob
I imported TextBlob to harness its natural language processing functionalities, which include sentiment analysis, part-of-speech tagging, and text parsing, among others, for enhanced text analysis tasks.

In [22]:
from textblob import TextBlob

In [23]:
# Install it if it is not installed yet
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [24]:
# Define a function to get sentiment polarity
def get_sentiment(text):
    analysis = TextBlob(text)
    
    # Return the polarity score
    return analysis.sentiment.polarity

I defined a function named **get_sentiment** to analyze the sentiment polarity of text data. Within the function, I instantiated a TextBlob object with the provided text, allowing me to access its sentiment analysis functionality. By retrieving the polarity score using **analysis.sentiment.polarity**, the function returns the sentiment polarity of the input text. 

**Sentiment polarity of text indicates the degree of positivity or negativity expressed in the text, typically ranging from -1 (negative) to +1 (positive), with 0 representing neutral sentiment.**

In [25]:
# Apply sentiment function to the cleaned_text column to get sentiment polarity
df['sentiment'] = df['cleaned_text'].apply(get_sentiment)

I used the **get_sentiment** function to analyze the sentiment polarity of each review in the **cleaned_text** column of the DataFrame 'df' and stored the results in a new column named 'sentiment'.

In [27]:
# Classify sentiment as positive, negative, or neutral based on polarity score
df['sentiment_category'] = df['sentiment'].apply(lambda x: 'positive' if x > 0 else 'negative' if x < 0 else 'neutral')

**I classified the sentiment of each review in the DataFrame 'df' as positive, negative, or neutral based on its polarity score. This was achieved by applying a lambda function to the 'sentiment' column, which assigns 'positive' if the polarity score is greater than 0, 'negative' if it's less than 0, and 'neutral' if it equals 0, storing the results in a new column named 'sentiment_category'.**

In [28]:
df.head()

Unnamed: 0.1,Unnamed: 0,review,text_tokens,cleaned_text,review_length,sentiment,sentiment_category
0,0,I previously wrote this review right after rea...,"[i, previously, wrote, this, review, right, af...",previously wrote review right reading book tod...,1684,0.130649,positive
1,1,"""Atomic Habits"" by James Clear is a transforma...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear transformative ...,138,0.169048,positive
2,2,I've read a lot of books on changing behavior ...,"[i, 've, read, a, lot, of, books, on, changing...",'ve read lot book changing behavior building h...,332,0.145893,positive
3,3,Very insightful and easy to read. It’s crazy h...,"[very, insightful, and, easy, to, read, ., it,...",insightful easy read ’ crazy simple thought ma...,24,-0.055556,negative
4,4,"""Atomic Habits"" by James Clear is pretty much ...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear pretty much gam...,179,0.228409,positive


In [29]:
df['sentiment_category'].value_counts()

positive    3688
negative     922
Name: sentiment_category, dtype: int64

In [30]:
df['sentiment_category'].value_counts(normalize=True)

positive    0.8
negative    0.2
Name: sentiment_category, dtype: float64

## Key takeways from Sentiment Analysis by TextBlob method
- manual sentiment labeling method provided a quick and efficient way to analyze customer reviews.
- approximately 80% of reviews expressed positive sentiment towards **Atomic Habits**.
- positive reviews praised the book's practical advice and transformative impact.
- negative feedback identified areas for refinement in content or delivery.
- stakeholders urged to leverage positive feedback for marketing and address concerns for continuous improvement.

# Method 2: By SentimentAnalyser

I did not belive the 20% negative sentiments, so I used this second technique to verify whether that's true or not

In [33]:
df2 = df.drop(columns = {'sentiment', 'sentiment_category'})
df3.head()

Unnamed: 0.1,Unnamed: 0,review,text_tokens,cleaned_text,review_length,vader_sentiment,vader_sentiment_category
0,0,I previously wrote this review right after rea...,"[i, previously, wrote, this, review, right, af...",previously wrote review right reading book tod...,1684,0.9998,positive
1,1,"""Atomic Habits"" by James Clear is a transforma...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear transformative ...,138,0.9932,positive
2,2,I've read a lot of books on changing behavior ...,"[i, 've, read, a, lot, of, books, on, changing...",'ve read lot book changing behavior building h...,332,0.9974,positive
3,3,Very insightful and easy to read. It’s crazy h...,"[very, insightful, and, easy, to, read, ., it,...",insightful easy read ’ crazy simple thought ma...,24,0.5267,positive
4,4,"""Atomic Habits"" by James Clear is pretty much ...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear pretty much gam...,179,0.9954,positive


In [34]:
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

# Initialize VADER
sid = SentimentIntensityAnalyzer()

# Define a function to get sentiment polarity using VADER
def get_vader_sentiment(text):
    # Get the polarity scores
    scores = sid.polarity_scores(text)
    # Return the compound score
    return scores['compound']

# Apply the function to the 'cleaned_text' column to get sentiment polarity
df2['vader_sentiment'] = df2['cleaned_text'].apply(get_vader_sentiment)

# Classify sentiment as positive, negative, or neutral based on compound score
df2['vader_sentiment_category'] = df2['vader_sentiment'].apply(lambda x: 'positive' if x > 0 else 'negative' if x < 0 else 'neutral')


df2.head()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\haide\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Unnamed: 0.1,Unnamed: 0,review,text_tokens,cleaned_text,review_length,vader_sentiment,vader_sentiment_category
0,0,I previously wrote this review right after rea...,"[i, previously, wrote, this, review, right, af...",previously wrote review right reading book tod...,1684,0.9998,positive
1,1,"""Atomic Habits"" by James Clear is a transforma...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear transformative ...,138,0.9932,positive
2,2,I've read a lot of books on changing behavior ...,"[i, 've, read, a, lot, of, books, on, changing...",'ve read lot book changing behavior building h...,332,0.9974,positive
3,3,Very insightful and easy to read. It’s crazy h...,"[very, insightful, and, easy, to, read, ., it,...",insightful easy read ’ crazy simple thought ma...,24,0.5267,positive
4,4,"""Atomic Habits"" by James Clear is pretty much ...","[``, atomic, habits, '', by, james, clear, is,...",`` atomic habit '' james clear pretty much gam...,179,0.9954,positive


I used VADER (Valence Aware Dictionary and sEntiment Reasoner), a sentiment analysis tool from NLTK, to analyze the sentiment polarity of the text data. 

- First, I initialized VADER and defined a function named 'get_vader_sentiment' to calculate the sentiment polarity using VADER's compound score.
- Then, I applied this function to the 'cleaned_text' column of DataFrame 'df2' to obtain sentiment polarity scores for each review.
- Subsequently, I classified these sentiment polarity scores into positive, negative, or neutral categories based on their compound score.
- Finally, I stored the resulting sentiment categories in a new column named 'vader_sentiment_category' in the DataFrame 'df2'.

In [35]:
df2['vader_sentiment_category'].value_counts()

positive    4610
Name: vader_sentiment_category, dtype: int64

Surprisingly, while TextBlob analysis revealed an 80-20% ratio of positive-negative sentiment categories, VADER Sentiment Analyzer classified all reviews as positive, indicating a stark contrast in sentiment categorization between the two methods.

In [36]:
df4 = pd.DataFrame({
    'sentiment_category': df['sentiment_category'],
    'vader_sentiment_category': df3['vader_sentiment_category']
})
df4.head(30)

Unnamed: 0,sentiment_category,vader_sentiment_category
0,positive,positive
1,positive,positive
2,positive,positive
3,negative,positive
4,positive,positive
5,positive,positive
6,positive,positive
7,positive,positive
8,negative,positive
9,positive,positive


In [46]:
df3['vader_sentiment_category'].value_counts()

positive    4610
Name: vader_sentiment_category, dtype: int64

In [47]:
df['sentiment_category'].value_counts()

positive    3688
negative     922
Name: sentiment_category, dtype: int64

# Results
I found that VADER Sentiment Analyzer performed better in this context of sentiment analysis for Amazon reviews of "Atomic Habits." Considering the overwhelmingly positive sentiment observed on the main Amazon page with over 120,000 reviews and a high rating of 4.8/5 for the book, it's reasonable to expect a predominantly positive sentiment in the reviews analyzed. VADER's optimized approach for analyzing informal text, such as online reviews, resulted in a more accurate classification of all reviews as positive. While TextBlob offers a versatile toolkit, VADER's alignment with the characteristics of the review data led me to choose it as the preferable method for this task.

## Key takeaways from Sentiment Analysis by VADER Sentiment Analyzer method:
- VADER's optimized approach accurately classified all reviews as positive, aligning with the overwhelmingly positive sentiment observed on the main Amazon page.
- The classification of all reviews as positive indicates a strong endorsement and satisfaction among readers for "Atomic Habits."
- The absence of negative sentiment suggests a high level of engagement and resonance with the book's content, reinforcing its effectiveness and appeal among readers.

# Conclusion
In this project, I embarked on a thorough journey of sentiment analysis for customer reviews of "Atomic Habits." Starting from importing essential libraries to classifying sentiment using both TextBlob and VADER Sentiment Analyzer methods, I meticulously navigated each step to derive meaningful insights. Through this process, I uncovered the remarkable positivity surrounding the book, reflecting its profound impact on readers. At least, I am really satisfied now that I digged into deeper and got to know most of the reviewers are thinking positive aboout the book. This project gave me confidence and the necessary skills to do further web scrapping and sentiment analysis. Probably, next time I will choose something where people have mixed reactions, so that the dataset would be more diverse and I can even more do some Feature Engineering and Machine Learning steps. Overall, the insights from this project should equip stakeholders with valuable guidance for strategic decision-making, emphasizing the importance of leveraging positive feedback and addressing potential areas for enhancement.