<a href="https://colab.research.google.com/github/angelaaaateng/AIR_AI_Engineering_Course_2024/blob/main/Day1/Activity1_SpacySentimentAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
# Install necessary libraries
# !pip install spacy spacytextblob
# !python -m spacy download en_core_web_sm
# Documentation: https://spacy.io/universe/project/spacy-textblob

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     --------------------------------------  12.6/12.8 MB 71.6 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 57.3 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
# Import necessary libraries
import spacy
from spacytextblob.spacytextblob import SpacyTextBlob
import pandas as pd
import string
from tqdm import tqdm

# Apply tqdm to pandas for progress tracking
tqdm.pandas()


In [3]:
# Load the SpaCy language model and add the textblob component for sentiment analysis
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("spacytextblob")


<spacytextblob.spacytextblob.SpacyTextBlob at 0x177582032f0>

In [4]:
# Optimized function for preprocessing the text using SpaCy
def preprocess_text(text):
    # Process the text using the SpaCy pipeline (no need to manually lowercase, SpaCy handles this)
    doc = nlp(text)

    # Use a list comprehension to filter and lemmatize tokens
    tokens = [token.lemma_ for token in doc if not token.is_stop and token.is_alpha]

    # Join the tokens back into a single string
    return " ".join(tokens)


In [5]:
# Function to classify the sentiment score as Positive, Negative, or Neutral
def classify_sentiment(score):
    if score > 0:
        return "positive"
    elif score < 0:
        return "negative"
    else:
        return "neutral"

In [6]:
# Load the IMDB dataset from GitHub
url = "https://github.com/angelaaaateng/AIR_AI_Engineering_Course_2024/raw/refs/heads/main/Datasets/IMDB_Dataset.csv"
df = pd.read_csv(url)

In [7]:
# Randomly sample 1000 entries from the dataset
df_sampled = df.sample(n=1000, random_state=42)
# takes a long time to process

In [8]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [9]:
print(df.size)
print(df_sampled.size)

100000
2000


- A sentiment score greater than 0 will be labeled as "Positive".
- A sentiment score less than 0 will be labeled as "Negative".
- A sentiment score of exactly 0 will be labeled as "Neutral".

In [10]:
# Preprocess the reviews
# df['cleaned_review'] = df['review'].apply(preprocess_text)
# Preprocess the reviews with a progress bar
df_sampled['cleaned_review'] = df_sampled['review'].progress_apply(preprocess_text)
df_sampled.head()

100%|██████████| 1000/1000 [00:29<00:00, 33.95it/s]


Unnamed: 0,review,sentiment,cleaned_review
33553,I really liked this Summerslam due to the look...,positive,like Summerslam look arena curtain look overal...
9427,Not many television shows appeal to quite as m...,positive,television show appeal different kind fan like...
199,The film quickly gets to a major chase scene w...,negative,film quickly get major chase scene increase de...
12447,Jane Austen would definitely approve of this o...,positive,Jane Austen definitely approve Paltrow awesome...
39489,Expectations were somewhat high for me when I ...,negative,expectation somewhat high go movie think Steve...


In [11]:
# Perform sentiment analysis on the preprocessed reviews
df_sampled['sentiment_score'] = df_sampled['cleaned_review'].progress_apply(lambda review: nlp(review)._.polarity)
df_sampled.head()

  0%|          | 1/1000 [00:00<00:15, 63.95it/s]


AttributeError: [E046] Can't retrieve unregistered extension attribute 'polarity'. Did you forget to call the `set_extension` method?

In [None]:
# Classify the sentiment based on the score
df_sampled['sentiment_label'] = df_sampled['sentiment_score'].progress_apply(classify_sentiment)
df_sampled.head()

In [None]:
# Display the results
df_sampled[['review', 'cleaned_review', 'sentiment_score', 'sentiment_label']].head()

Documentation: [Spacy's Textblob Implementation](https://spacy.io/universe/project/spacy-textblob); Textblob's [Documentation](https://textblob.readthedocs.io/en/dev/)

**TextBlob** is a simple, easy-to-use Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks, such as part-of-speech tagging, noun phrase extraction, sentiment analysis, classification, translation, and more.

### Key Features of TextBlob:

1. **Sentiment Analysis**:
   - One of the most popular features of TextBlob is its ability to perform sentiment analysis. It uses two key measures:
     - **Polarity**: This is a float value within the range [-1.0, 1.0], where:
       - `-1` represents a very negative sentiment.
       - `0` represents a neutral sentiment.
       - `+1` represents a very positive sentiment.
     - **Subjectivity**: This is a float within the range [0.0, 1.0] that represents the degree of subjectivity in the text.
       - `0` means the text is very objective (fact-based).
       - `1` means the text is highly subjective (opinion-based).

2. **Part-of-Speech Tagging**:
   - TextBlob can tag parts of speech, such as nouns, verbs, adjectives, etc. For example, it can classify the words in a sentence to understand the structure and function of each word.

3. **Noun Phrase Extraction**:
   - It can extract noun phrases (a group of words that act as a noun, like "machine learning model") to help understand the focus of a text.

4. **Translation and Language Detection**:
   - TextBlob has built-in capabilities to translate between languages and detect the language of the given text using the Google Translate API.

5. **Classification**:
   - TextBlob supports text classification, which allows you to classify text into different categories based on training data.

### How Does TextBlob Perform Sentiment Analysis?
TextBlob’s sentiment analysis is based on a tool called **Pattern**, which is a web mining module for Python. The Pattern library has pre-trained models for sentiment analysis and subjectivity detection. TextBlob utilizes these pre-trained models to assess sentiment without requiring further training, making it ideal for simpler, out-of-the-box sentiment tasks.

In the context of **SpaCy**, `spacytextblob` acts as a bridge that incorporates TextBlob's capabilities (specifically for sentiment analysis) into SpaCy’s powerful language model pipeline. This allows users to combine the strengths of SpaCy (fast, tokenization, and parsing) with TextBlob's sentiment analysis feature.

### Why Use TextBlob?

1. **Ease of Use**: TextBlob simplifies many of the common NLP tasks with its intuitive API.
2. **Quick Start**: It works well for beginners who want to quickly try out basic NLP tasks without requiring complex configurations or machine learning model setups.
3. **Lightweight**: TextBlob is light compared to larger NLP libraries, making it perfect for smaller tasks or rapid prototyping.

### Limitations:
- **Shallow Analysis**: While great for quick tasks, TextBlob's sentiment analysis and language understanding are relatively shallow compared to more complex models like transformers or BERT-based models.
- **Accuracy**: It uses simple, rule-based methods and can be less accurate in some NLP tasks, especially with nuanced text or in situations where deep contextual understanding is needed.

TextBlob is especially suited for quick sentiment analysis, simple text processing tasks, and learning purposes, but for more advanced applications, you would likely move towards deeper models or tools like SpaCy, BERT, or GPT.