<a href="https://colab.research.google.com/github/micah-shull/LLMs/blob/main/LLM_014_sentiment_analysis_huggingface.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Why Sentiment Analysis with Hugging Face is Valuable in Business

1. **Customer Insights**:
   - Businesses can analyze product reviews, feedback, and surveys to gauge customer satisfaction and identify areas for improvement.
   - Sentiment analysis helps detect trends in customer opinion, such as common complaints or positive highlights, which can guide product development and customer support strategies.

2. **Brand Monitoring**:
   - Companies use sentiment analysis to track their brand reputation on social media and in news articles. By assessing sentiment around brand mentions, businesses can respond quickly to negative feedback and capitalize on positive trends.

3. **Market Research**:
   - By analyzing sentiment in discussions about competitors or industry trends, companies can gain insights into customer preferences and emerging market demands.

4. **Decision Support**:
   - Sentiment analysis supports decision-making by providing data-driven insights into public opinion, which can guide marketing, product development, and even financial decisions.



In [2]:
# !pip install transformers datasets
# !pip install python-dotenv

### Step 1: Environment Setup and Imports

First, ensure that we have the Hugging Face API key set up, the necessary libraries imported, and the Hugging Face sentiment-analysis pipeline initialized.

The **SMS Spam** and **IMDb** datasets are great choices for sentiment analysis, covering spam detection and movie reviews. Here are a few other popular datasets that provide diverse sentiment analysis contexts:

### 1. **Twitter Sentiment Analysis Datasets**
   - **Sentiment140**: This dataset contains 1.6 million tweets labeled as positive, negative, or neutral.
     - **Use Case**: Ideal for social media sentiment analysis, especially if you’re interested in informal language, slang, and emojis.
     - **Access**: Available on [Kaggle](https://www.kaggle.com/datasets) or directly from Sentiment140’s website.
   - **Airline Twitter Sentiment**: Contains tweets about airlines labeled as positive, negative, and neutral.
     - **Use Case**: Perfect for sentiment analysis in customer feedback, particularly in the context of service industries.
     - **Access**: Available on [Kaggle](https://www.kaggle.com/crowdflower/twitter-airline-sentiment).

### 2. **Amazon Product Reviews**
   - **Amazon Reviews (multiple domains)**: This dataset contains reviews from Amazon across various product categories, labeled with star ratings that can be converted to positive or negative sentiment.
   - **Use Case**: Ideal for product sentiment analysis and understanding consumer opinions on products.
   - **Access**: Available on [Hugging Face](https://huggingface.co/datasets/amazon_polarity) or [Kaggle](https://www.kaggle.com/datasets).

### 3. **Yelp Reviews Dataset**
   - **Yelp Polarity Reviews**: This dataset contains Yelp reviews labeled as positive or negative.
   - **Use Case**: Useful for analyzing customer sentiment related to restaurants, services, and businesses.
   - **Access**: Available on [Hugging Face](https://huggingface.co/datasets/yelp_polarity).

### 4. **Stanford Sentiment Treebank (SST)**
   - **SST-2**: A popular sentiment dataset containing movie reviews from Rotten Tomatoes, labeled as positive or negative (binary).
   - **SST-5**: A more granular version with five sentiment labels (very negative to very positive).
   - **Use Case**: Commonly used to benchmark NLP models for sentiment, ideal for nuanced sentiment classification.
   - **Access**: Available on [Hugging Face](https://huggingface.co/datasets) or through [NLTK’s dataset collection](https://nlp.stanford.edu/sentiment/index.html).

### 5. **Financial Sentiment Analysis (Financial PhraseBank)**
   - **Financial PhraseBank**: This dataset contains financial news statements labeled as positive, negative, neutral, or ambiguous.
   - **Use Case**: Useful for sentiment analysis in finance, helping to understand sentiment in financial news or investor opinions.
   - **Access**: Available on [Kaggle](https://www.kaggle.com/datasets) or by request from the dataset creators.

### For Our Notebook
To keep it simple and focused, starting with **SMS Spam** and **IMDb** should provide enough variety:
- **SMS Spam**: Binary classification between “spam” (negative) and “ham” (neutral).
- **IMDb**: Movie reviews, perfect for testing nuanced sentiment detection.


In [4]:
# Necessary imports
from transformers import pipeline
from sklearn.metrics import classification_report, accuracy_score
from sklearn.datasets import fetch_20newsgroups, fetch_openml
from datasets import load_dataset
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv('/content/huggingface_api_key.env')
api_key = os.getenv("HUGGINGFACE_API_KEY")
# os.environ["HUGGINGFACEHUB_API_TOKEN"] = api_key
os.environ["HF_TOKEN"] = api_key

# Initialize Hugging Face sentiment-analysis pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")


### Step 2: Define Helper Functions for Dataset Preparation and Sentiment Analysis

To streamline the process, we’ll set up helper functions for:
- **Preparing each dataset**.
- **Running sentiment analysis**.
- **Evaluating model performance**.

#### Helper Function: Load and Preprocess Each Dataset

This function will load each dataset, extract the documents and labels, and convert labels to binary sentiment (if necessary) to work with our sentiment-analysis model.






In [7]:
from datasets import load_dataset

# Load SMS Spam dataset
def load_dataset_for_sentiment(dataset_name):
    if dataset_name == "imdb":
        dataset = load_dataset("imdb")
        documents = dataset["train"]["text"]
        labels = ["pos" if label == 1 else "neg" for label in dataset["train"]["label"]]

    elif dataset_name == "sms_spam":
        dataset = load_dataset("sms_spam")
        documents = dataset["train"]["sms"]
        labels = ["pos" if label == "spam" else "neg" for label in dataset["train"]["label"]]

    else:
        raise ValueError("Dataset not supported.")

    return documents, labels

# Load datasets
imdb_documents, imdb_labels = load_dataset_for_sentiment("imdb")
sms_documents, sms_labels = load_dataset_for_sentiment("sms_spam")


README.md:   0%|          | 0.00/4.98k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/359k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/5574 [00:00<?, ? examples/s]

### View Documents & Labels

In [8]:
# Function to preview a sample of documents and labels
def preview_data(documents, labels, num_samples=5):
    for i in range(num_samples):
        print(f"Sample {i + 1}")
        print("Document:", documents[i])
        print("Label:", labels[i])
        print("-" * 50)

# Preview IMDb dataset
print("IMDb Dataset Preview:")
preview_data(imdb_documents, imdb_labels)

# Preview SMS Spam dataset
print("\nSMS Spam Dataset Preview:")
preview_data(sms_documents, sms_labels)


IMDb Dataset Preview:
Sample 1
Document: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and n

### Step 3: Define Function to Perform Sentiment Analysis and Evaluation

Now, let’s create a function to perform sentiment analysis on each document in the dataset and evaluate the model’s performance.



In [9]:
def analyze_and_evaluate_sentiment(documents, labels):
    predictions = []

    # Perform sentiment analysis
    for doc in documents:
        sentiment = sentiment_pipeline(doc[:512])  # Limit to 512 tokens
        predicted_label = "pos" if sentiment[0]['label'] == 'POSITIVE' else "neg"
        predictions.append(predicted_label)

    # Calculate evaluation metrics
    print("Classification Report:\n", classification_report(labels, predictions))
    print("Accuracy:", accuracy_score(labels, predictions))

    return predictions

### Step 4: Run the Analysis for Each Dataset

We’ll load each dataset, perform sentiment analysis, and display results.


### Explanation of Each Step
1. **Helper Functions**: We created functions to load and preprocess each dataset and perform sentiment analysis.
2. **Sentiment Analysis**: Each document is passed through the Hugging Face pipeline, which limits documents to 512 tokens.
3. **Evaluation**: After analyzing each dataset, we calculate and print the classification report and accuracy.

This setup should get us ready to test each dataset in sequence and see how well the model performs. Let me know if you’d like to make any adjustments or if you’re ready to run the analysis!

In [10]:
# List of datasets to analyze
datasets = ["imdb", "sms_spam"]

for dataset_name in datasets:
    print(f"\nAnalyzing dataset: {dataset_name}")
    documents, labels = load_dataset_for_sentiment(dataset_name)
    predictions = analyze_and_evaluate_sentiment(documents, labels)


Analyzing dataset: imdb
Classification Report:
               precision    recall  f1-score   support

         neg       0.81      0.85      0.83     12500
         pos       0.84      0.80      0.82     12500

    accuracy                           0.83     25000
   macro avg       0.83      0.83      0.83     25000
weighted avg       0.83      0.83      0.83     25000

Accuracy: 0.82676

Analyzing dataset: sms_spam
Classification Report:
               precision    recall  f1-score   support

         neg       1.00      0.64      0.78      5574
         pos       0.00      0.00      0.00         0

    accuracy                           0.64      5574
   macro avg       0.50      0.32      0.39      5574
weighted avg       1.00      0.64      0.78      5574

Accuracy: 0.6420882669537137


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


## Lessons Learned
Certainly! Here’s a summary of the most important lessons when using Large Language Models (LLMs) like Hugging Face models for sentiment analysis:

### 1. **Pre-trained Models and Fine-Tuning**
   - **Pre-trained Models**: Hugging Face provides pre-trained sentiment analysis models (like `distilbert-base-uncased-finetuned-sst-2-english`), which have been trained on large datasets. These models are capable of performing sentiment analysis without needing additional training, saving time and resources.
   - **Fine-Tuning for Specific Domains**: If you’re working with unique text types (e.g., medical or legal documents), fine-tuning the model on domain-specific data can improve accuracy by teaching the model relevant context.

### 2. **Pipeline Setup and Convenience**
   - Hugging Face’s `pipeline` API simplifies the process by handling model loading, tokenization, and predictions in a single line of code, making it ideal for quick, plug-and-play sentiment analysis tasks.
   - The pipeline also abstracts away many details, allowing you to focus on results rather than implementation specifics.

### 3. **Text Limitations and Tokenization**
   - **Token Limit**: Most LLMs (especially BERT-based models) have a 512-token limit. When processing longer texts, you may need to truncate them or split them into smaller chunks to ensure they fit within this limit.
   - **Tokenization**: Understanding tokenization (breaking text into sub-word units) is crucial. LLMs tokenize text in a specific way, which can affect how information is represented. This is generally handled by the pipeline, but understanding it helps in cases where you might need custom preprocessing.

### 4. **Label Mapping and Interpretation**
   - **Model-Specific Labels**: Most Hugging Face models output `"POSITIVE"` and `"NEGATIVE"` for binary classification, but mapping these labels to custom values (like `"pos"` and `"neg"`) may be needed for evaluation consistency.
   - **Confidence Scores**: LLMs provide confidence scores for predictions, allowing for threshold adjustments. For instance, you might only classify text as “positive” if the model’s confidence is above a certain threshold.

### 5. **Performance Evaluation and Limitations**
   - **Accuracy Metrics**: Use metrics like accuracy, precision, recall, and F1-score to evaluate performance. These metrics reveal how well the model handles sentiment analysis across different datasets.
   - **Model Limitations**: LLMs may struggle with nuances like sarcasm, subtle irony, or domain-specific jargon. It’s important to test the model on varied datasets to understand where it performs well and where it may need fine-tuning.

### 6. **Practical Business Applications and Deployment**
   - **Real-World Applications**: Sentiment analysis with LLMs can be used in customer service, brand monitoring, market research, and social media analysis. Recognizing the scope of applications helps identify where sentiment analysis adds the most value.
   - **Efficiency and Cost**: Consider model efficiency, as LLMs can be resource-intensive. DistilBERT and other lighter models are often a good balance of performance and speed for real-time applications.

### Key Takeaways
   - **Start with Pre-trained Models**: They’re highly effective for general sentiment analysis.
   - **Adjust for Specific Needs**: Fine-tuning or using custom thresholds can improve accuracy.
   - **Understand Contextual Strengths**: LLMs excel with nuanced language but may struggle with certain subtleties.
