<a href="https://colab.research.google.com/github/nickstone1911/data-analysis-practice/blob/main/Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lesson 21: Part 2 - Sentiment Analysis in Python

## Exercise Overview:

This lesson continues our discussion of text analysis by focusing on several approaches to sentiment analysis.

>- Overview: Text analysis involves programmatically processing text to gain insights or provide capabilities where human language is the underlying input.

In this lesson we will discuss the major components of text analysis:

1. Introduction to sentiment analysis
2. Sentiment analysis with the `nltk` package
3. Sentiment analysis with the `textblob` package



---





# Sentiment Analysis

Sentiment anlaysis is an approach to try and determine how positive or negative text is. This can be useful when examining customer reviews or surveys of a product or company.

Sentiment analysis is a natural language processing (NLP) technique used to determine the sentiment or emotional tone expressed in a piece of text, such as a tweet, review, or comment. It's a common task in text analysis that can help you understand whether a text conveys a positive, negative, or neutral sentiment.

In Python, we can perform sentiment analysis using various libraries and tools. One of the most popular libraries for this task is the `Natural Language Toolkit (NLTK)` and the `TextBlob` library, which provides a simple and beginner-friendly way to get started.

---


# Section 1: Setting Up the Environment

## 1.1: Imports

In the next cell import the following modules:

>- `os`, `numpy`, `pandas`, `matplotlib.pyplot`, `seaborn`
>- Use the common aliases for each of these


In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## 1.2: Set Your Working Directory

In the next cell(s), set your working directory to your desired workspace for this lesson.

In [None]:
from google.colab import drive
drive.mount('/content/drive/')
os.chdir('/content/drive/MyDrive/BAIM4205')
os.getcwd()

Mounted at /content/drive/


'/content/drive/MyDrive/BAIM4205'

## 1.3: Load Data

Sample guitar review data following file. Download/move the file to your working directory.

>- [GuitarReviews2out.txt](https://drive.google.com/file/d/1HnlHA-XqqTXle0SHS7V20sqwZNfwKbRl/view?usp=sharing)

### Data Description
>- In this example we are using a sample product review file, `GuitarReviews2out.txt`.
>- Each row of this data file represents a single guitar product review
>- Note: The columns of this file are seperated with a bar `|` so we need to pass in the correct `sep` argument to `read_csv()` to account for this seperator

#### `GuitarReviews2out.txt` Notes
>- This dataset has 3 fields
  1. `product` represents the product being reviewed
  2. `title` represents a summary of the review
  3. `review` is the main text of the review

  ---



Load the file into a pandas DataFrame named, `guita_reviews`:


In [None]:
guita_reviews = pd.read_csv('GuitarReviews2out.txt', sep='|', index_col='#')
guita_reviews.head()

Unnamed: 0_level_0,product,title,review
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,BenderCaster,My new favorite guitar,"I’ve been playing the Wes Hall for many years,..."
2,BenderCaster,Good but not Great,"The Bender Caster sounds clean, the 3-position..."
3,BenderCaster,Try before buying,I tried the Caster in my local music store and...
4,BenderCaster,Love It!,Everything about the Bender Caster met or exce...
5,BenderCaster,"Great Guitar, Great Price","I’ve been playing Bender guitars for years, th..."


# Section 2: Sentiment Analysis with `nltk`

In this section we learn how to use the `nltk` and the `SentimentIntensityAnalyzer`. The `SentimentIntensityAnalyzer` uses the *Valance Aware Dictionary for Sentiment Reasoner* (VADER) model to determine whether a text passage is positive or negative.

We will use the `guitar_file` we read in at the beginning of this notebook for this tutorial.

---

## 2.1: Imports

For this section we can import the following:

```python
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# download required component
nltk.download('vader_lexicon')
```


In [None]:
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# download required component
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


True

## 2.2: Create Vader Analyzer

```python
analyzer = SentimentIntensityAnalyzer()
```

In [None]:
analyzer = SentimentIntensityAnalyzer()

## 2.3: Prepare Data

>- First we will explore how `analyzer` works on one review
>>- We will call the `polarity_scores()` method and pass in our text for this.
>>- `polarity_scores()` returns a dictionary with 4 items: `'neg'`, `'neu'`, and `'pos'` are mapped to percentages that represent the percentage of text that is negative, neutral, and positive respectively. `'compound'` returns a value between -1 and 1 which indicates how positive or negative a passage is where 1 is perfectly positive and -1 is perfectly negative.
>- Then we will apply `analyzer` to all reviews in the `guitar_reviews` DataFrame

First, grab the first review:

In [None]:
review1 = guita_reviews.iloc[0]['review']
review1

'I’ve been playing the Wes Hall for many years, but finally decided to buy this Bender Caster.  My Wes has terrific sustain, but the Caster gives me a funky sound I love!  The maple neck is terrific too.  Would highly recommend this guitar!'

Now call `polarity_scores(review1)` on analyzer

In [None]:
analyzer.polarity_scores(review1)

{'neg': 0.028, 'neu': 0.621, 'pos': 0.351, 'compound': 0.9646}

Notes on the first review's sentiment:

>- Most (62%) of the first review was classified as `'neu'` while very little (2.8%) was `'neg'`.
>- The `'pos'` score of 35.1% and the `'compound'` score of 0.9646 indicates that the first review was coded as a very positive review

## 2.4: Apply `VADER` to All Reviews

>- Step 1: We will create a function, `sentiment_score(text)` to store all the `'compound'` values for each review
>- Step 2: Then we will use our function to pass to `apply()` to create a new column called `sentiment_score`


In [None]:
def sentiment_score(text):
    compound = analyzer.polarity_scores(text)['compound']
    return compound

In [None]:
guita_reviews['sentiment_score'] = guita_reviews['review'].apply(sentiment_score)
guita_reviews.head(10)

Unnamed: 0_level_0,product,title,review,sentiment_score
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,BenderCaster,My new favorite guitar,"I’ve been playing the Wes Hall for many years,...",0.9646
2,BenderCaster,Good but not Great,"The Bender Caster sounds clean, the 3-position...",0.8074
3,BenderCaster,Try before buying,I tried the Caster in my local music store and...,0.4854
4,BenderCaster,Love It!,Everything about the Bender Caster met or exce...,0.9533
5,BenderCaster,"Great Guitar, Great Price","I’ve been playing Bender guitars for years, th...",0.9136
6,WesHall,Amazing Guitar,I’ve been playing an entry level guitar for ye...,0.9751
7,WesHall,None better,The Wes Hall is THE GUITAR for rock & roll. I...,0.9184
8,WesHall,Great Choice,I’ve been playing guitar occasionally but now ...,0.9738
9,WesHall,Good Guitar but Pricey,I was excited to order the Wes but given the p...,-0.8225
10,WesHall,It’s it,"Unbelievable! It’s the classic, the original....",0.5255


# Section 3 Using `textblob` for Sentiment Analysis

`TextBlob` is a general-purpose NLP library that offers a wide range of text processing capabilities beyond sentiment analysis. If you need to perform other NLP tasks like part-of-speech tagging, noun phrase extraction, or translation, TextBlob might be a more versatile choice.

`TextBlob` is known for its simplicity and beginner-friendliness. It requires minimal setup and provides a simple API for text analysis tasks, including sentiment analysis.

While `TextBlob` provides a basic sentiment analysis feature, it's also highly customizable. You can train your own sentiment classifier if you have labeled data, allowing you to tailor the sentiment analysis to your specific domain or requirements.

---



### Basic steps when using `TextBlob` for sentiment Analysis

1. Make sure `TextBlob` is installed: `!pip install textblob`
2. Import `TextBlob`
3. Prepare Data
4. Create a `TextBlob` object
5. Return sentiment with `polarity` and `subjectivity`
  >- `blob.sentiment.polarity` returns a number between -1 and 1 where...
  >>- values close to 1 indicate positive sentiment
  >>- values close to -1 indicate negative sentiment
  >>- values close to 0 indicate neutral sentiment.
  
  >- `blob.sentiment.subjectivity` indicates how subjective or objective the text is.
  >>- Values close to 1 indicate the text is very subjective or emotional.
>>- Values close to 0 indicate the text is very objective and factual.


---

Let's practice using `TextBlob` with our `guitar_reviews` data.

## 3.1: Install textblob

```python
!pip install textblob
```

In [None]:
!pip install TextBlob



## 3.2: Import textblob

In [None]:
import textblob

## 3.3: Create a TextBlob() object

>- Create a text blob object based on the `review1` defined earlier.


In [None]:
blob = textblob.TextBlob(review1)

Now we can call `sentiment.poloarity` and `sentiment.subjectivity` on the `blob` object.

In [None]:
print(round(blob.sentiment.polarity,2))
print(round(blob.sentiment.subjectivity,2))

0.25
0.72


## 3.4: Return sentiment and polarity

In this section we create functions to return polarity and subjectivity

In [None]:
def polarity(text):
    blob = textblob.TextBlob(text)
    polarity = round(blob.sentiment.polarity,2)
    return polarity
def subjectivity(text):
    blob = textblob.TextBlob(text)
    subjectivity = round(blob.sentiment.subjectivity,2)
    return subjectivity

Show the sentiment and subjectivty of `review1`.
>- Round to two decimal places
>- Enter data in `0.XX` format

In [None]:
print(polarity(review1), subjectivity(review1))

0.25 0.72


## 3.5: Create `blob_sentiment` and `blob_subjectivity` columns

In the next cell, add columns: `blob_sentiment` and `blob_subjectivity` to the `guitar_reviews` DataFrame that stores the blob sentiment and subjectivity for each review.

In [None]:
guita_reviews['blob_sentiment'] = guita_reviews['review'].map(polarity)
guita_reviews['blob_subjectivity'] = guita_reviews['review'].map(subjectivity)
guita_reviews.head(10)

Unnamed: 0_level_0,product,title,review,sentiment_score,blob_sentiment,blob_subjectivity
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,BenderCaster,My new favorite guitar,"I’ve been playing the Wes Hall for many years,...",0.9646,0.25,0.72
2,BenderCaster,Good but not Great,"The Bender Caster sounds clean, the 3-position...",0.8074,0.15,0.52
3,BenderCaster,Try before buying,I tried the Caster in my local music store and...,0.4854,0.45,0.44
4,BenderCaster,Love It!,Everything about the Bender Caster met or exce...,0.9533,0.55,0.67
5,BenderCaster,"Great Guitar, Great Price","I’ve been playing Bender guitars for years, th...",0.9136,0.42,0.44
6,WesHall,Amazing Guitar,I’ve been playing an entry level guitar for ye...,0.9751,0.03,0.85
7,WesHall,None better,The Wes Hall is THE GUITAR for rock & roll. I...,0.9184,0.31,0.52
8,WesHall,Great Choice,I’ve been playing guitar occasionally but now ...,0.9738,0.36,0.65
9,WesHall,Good Guitar but Pricey,I was excited to order the Wes but given the p...,-0.8225,0.0,0.44
10,WesHall,It’s it,"Unbelievable! It’s the classic, the original....",0.5255,0.18,0.6


---
# Wrap-Up and Combining Results Notes

##VADER (Valence Aware Dictionary and sEntiment Reasoner):

Strengths:

>- Specifically designed for sentiment analysis of social media text.
>- Incorporates sentiment intensity analysis, including polarity (positive, negative, neutral) and intensity.
>- Handles sentiment-laden emojis, slang, and emoticons common in social media.
>- Provides a compound score summarizing overall sentiment.

Limitations:

>- May not perform as well on formal or domain-specific text.
>- Limited support for languages other than English.
>- Relies on a pre-defined lexicon, which may not cover all sentiments or contexts.

##TextBlob:

Strengths:

>- Simple API for various NLP tasks, including sentiment analysis.
>- Supports multiple languages and can be extended with custom models.
>- Includes a built-in sentiment analyzer trained on a large dataset of movie reviews.
>- Provides both polarity and subjectivity scores.

Limitations:

>- May not handle sentiment in social media or informal text as well as VADER.
>- Sentiment analysis model may not be as accurate or fine-tuned for specific domains.

---

## Combining `VADER` and `TextBlob`:

Both `VADER` and `TextBlob` can provide a more nuanced understanding of sentiment, especially in social media or informal contexts. It may be worth exploring a systematic way to combine the results of both of these libraries. Below are some ideas to consider.

1. Weighted Average: Assign different weights to the sentiment scores from VADER and TextBlob based on their reliability or performance on your specific data. For example, you could use a 70-30 weighting, giving more weight to the algorithm that performs better on your dataset.

2. Threshold-based Approach: Use one algorithm as a primary sentiment analyzer and then use the other algorithm to either confirm or override the primary sentiment if it crosses a certain threshold. This way, you can ensure that the sentiment is more robust and less affected by outliers.

3. Conflict Resolution: If the sentiments from VADER and TextBlob are conflicting, you can use certain rules to resolve the conflict. For example, you could give more weight to the sentiment with stronger intensity or consider the context of the text to decide which sentiment is more appropriate.

4. Aspect-based Sentiment Analysis: Divide the text into different aspects or topics and analyze the sentiment for each aspect separately using both algorithms. This way, you can identify the overall sentiment for each aspect and then aggregate them to get an overall sentiment score.

5. Machine Learning Ensemble: Train a machine learning model using the sentiment scores from both algorithms as features. The model can learn the best way to combine these scores to predict the overall sentiment more accurately.

6. Contextual Analysis: Consider the context of the text and the sentiment words/phrases identified by each algorithm. Some algorithms may perform better in certain contexts or for certain types of texts.

7. Error Analysis: Analyze the cases where VADER and TextBlob disagree to understand the strengths and weaknesses of each algorithm. This can help you make informed decisions on how to combine their results effectively.

By combining the results of VADER and TextBlob using these approaches, you can create a more robust sentiment analysis system that leverages the strengths of both algorithms and provides more accurate sentiment analysis for your specific use case.

