**1. Data Cleaning and Labeling** - You must manually or programmatically label each feedback entry with one of the sentiments:

Positive

Neutral

Negative

This is supervised learning; thus, accurate labels are critical for model quality.

**Step 2: Exploratory Data Analysis (EDA)**

2.1 Basic Analysis Check class balance (Positive, Negative, Neutral).

Review text length distribution, frequently occurring words, and common phrases.

2.2 Visualizations Word Cloud: Identify frequently used words per sentiment.

Bar Graphs/Pie Charts: Class distribution.

**Step 3: Text Preprocessing**

3.1 Cleaning and Normalization
Lowercasing: Convert text to lowercase.

Remove punctuation, symbols, and numeric characters.

Remove stop words: (e.g., "the", "is", "and").

Stemming/Lemmatization: Normalize words (e.g., "running" → "run").


**Hugging Face Pre-trained Model**

**Overview**

This approach utilizes Hugging Face's transformers library, specifically a pre-trained sentiment analysis model to classify feedback text into sentiments.

**Steps:**

*Install Libraries:*

pip install pandas transformers

*Data Loading:*
Load the feedback data from the provided CSV file.

*Sentiment Prediction:*

Use Hugging Face's sentiment-analysis pipeline to analyze each feedback entry.

*Results Evaluation and Saving:*

Summarize the sentiment distribution.

Save the predictions to a CSV file.

***Advantages:***

Quick setup with high accuracy using pre-trained models.

Minimal preprocessing required.

***Limitations:***

Pre-trained models may lack specificity for highly specialized or industry-specific texts.

In [6]:
# Sentiment Analysis Using Hugging Face Pre-trained Model

# Step 1: Install Required Libraries
!pip install pandas transformers

# Step 2: Import Libraries
import pandas as pd
from transformers import pipeline

# Step 3: Load Data
# Replace with your CSV file path
df = pd.read_csv('/content/sample_data/user_feedback_dataset_corrected.csv')

# Step 4: Hugging Face Model (Pre-trained Sentiment Analysis)
sentiment_pipeline = pipeline("sentiment-analysis")

# Apply model on feedback texts
#1. Apply a function to each entry in the column FEEDBACK_TEXT.
#2. The applied function uses the sentiment_pipeline from Hugging Face to predict sentiment for each piece of text.
#3. sentiment_pipeline(x) returns a list of dictionaries with sentiment predictions, such as:[{'label': 'POSITIVE', 'score': 0.95}]
#[0]['label'] extracts the sentiment label ('POSITIVE' or 'NEGATIVE') from this list.
#The resulting sentiment label is assigned to a new column named huggingface_sentiment in the DataFrame.

df['huggingface_sentiment'] = df['FEEDBACK_TEXT'].apply(lambda x: sentiment_pipeline(x)[0]['label'])

# Evaluate Results
print("Hugging Face Sentiment Analysis Distribution:")
print(df['huggingface_sentiment'].value_counts())

# Save results
df.to_csv('huggingface_sentiment_analysis_results.csv', index=False)

print("Analysis complete. Results saved to 'huggingface_sentiment_analysis_results.csv'.")





No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

Device set to use cpu


Hugging Face Sentiment Analysis Distribution:
huggingface_sentiment
NEGATIVE    251
POSITIVE    249
Name: count, dtype: int64
Analysis complete. Results saved to 'huggingface_sentiment_analysis_results.csv'.


**2. Custom ML Model**

Overview

This approach involves training a custom sentiment analysis model using labeled data. The sentiment labels are generated programmatically using the VADER sentiment analyzer.

**Steps:**

*Install Libraries:*

pip install pandas numpy sklearn nltk vaderSentiment

*Data Loading and Preprocessing:*

Load and preprocess text data by cleaning, removing stop words, and lemmatizing.

*Programmatic Labeling:*

Automatically generate sentiment labels (Positive, Neutral, Negative) using VADER.

Feature Extraction: *italicized text*

Convert text data into numerical vectors using TF-IDF.

*Model Training:*

Train a Random Forest classifier on the processed data.

Evaluation and Reporting:

Generate a classification report showing accuracy, precision, recall, and F1-score.

*Results Saving:*

Save model predictions and data to a CSV file.

***Advantages:***

Highly customizable to specific datasets.

Potentially more accurate for domain-specific language.

***Limitations:***

Requires extensive preprocessing and manual tuning.

Performance depends significantly on the quality of labeled data.

***Use Case Recommendation:***

Hugging Face approach for rapid deployment and standard texts.

Custom ML model for specialized, domain-specific, or nuanced datasets.



In [8]:
# Sentiment Analysis Using Custom ML Model

# Step 1: Install Required Libraries
!pip install pandas numpy sklearn nltk vaderSentiment
!pip install vaderSentiment

# Step 2: Import Libraries
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk

nltk.download('stopwords')
nltk.download('wordnet')

# Step 3: Load Data
# Replace with your CSV file path
df = pd.read_csv('/content/sample_data/user_feedback_dataset_corrected.csv')

# Step 4: Text Preprocessing
def preprocess(text):
    text = text.lower()
    text = re.sub(r'[^a-z\s]', '', text)
    words = text.split()
    lemmatizer = WordNetLemmatizer()
    clean_words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')]
    return ' '.join(clean_words)

df['clean_text'] = df['FEEDBACK_TEXT'].apply(preprocess)

# Step 5: Programmatic Labeling (VADER)
# The SentimentIntensityAnalyzer from the VADER (Valence Aware Dictionary and sEntiment Reasoner) module is designed to determine the sentiment of textual data.
# It assesses whether text expresses a positive, negative, or neutral tone, along with the strength of these sentiments.
#It works based on Lexicon-Based Approach:
#It uses a lexicon (dictionary of words) containing sentiment scores to calculate the sentiment polarity of text.
#Output Scores:
#Provides a detailed breakdown including:
#positive (pos): Probability of positive sentiment.
#negative (neg): Probability of negative sentiment.
#neutral (neu): Probability of neutral sentiment.
#compound: Combined score summarizing sentiment (-1 to +1).

analyzer = SentimentIntensityAnalyzer()

def get_sentiment(text):
    scores = analyzer.polarity_scores(text)
    if scores['compound'] >= 0.05:
        return 'Positive'
    elif scores['compound'] <= -0.05:
        return 'Negative'
    else:
        return 'Neutral'

df['sentiment'] = df['clean_text'].apply(get_sentiment)

# Step 6: Custom ML Model
#TfidfVectorizer converts textual data into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency).
#Term Frequency (TF): Measures how often a word appears in a document.
#Inverse Document Frequency (IDF): Measures how common or rare a word is across all documents.
#TF-IDF assigns higher weights to important, rare words and lower weights to common words.

vectorizer = TfidfVectorizer(max_features=5000)
#This line transforms textual feedback into a numerical matrix of TF-IDF features that can be directly used to train a machine learning model.
#This applies the TF-IDF Vectorizer to the text data.
#It performs two actions:
#fit: Learns the vocabulary and the IDF (Inverse Document Frequency) from the text data.
#transform: Converts each document (text) into a TF-IDF-weighted numerical vector.
#X: The result is a sparse matrix of shape (n_samples, n_features):
#n_samples = number of text records.
#n_features = number of unique words used (up to max_features, like 5000).
#Each row in X is a vector representation of one piece of feedback.
#A sparse matrix is a matrix in which most of the elements are zero.

X = vectorizer.fit_transform(df['clean_text'])
y = df['sentiment']

#Training Data	Input-output pairs used to train the model
#Test Data	New data used to evaluate model performance. Purpose	Prevent overfitting and ensure generalization
#If you train and test on the same data, the model might memorize answers instead of learning patterns — leading to overfitting.
#Splitting the data ensures:
#Your model can generalize to new inputs.
#You get a realistic estimate of model performance.


#Splits data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

clf = RandomForestClassifier(n_estimators=100, random_state=42)
#Trains the model on training data
clf.fit(X_train, y_train)

#Makes predictions on test data
y_pred = clf.predict(X_test)

# Step 7: Evaluate Model
print("Custom ML Model Classification Report:")
# evaluate your machine learning model's performance by comparing the true labels (y_test) with the predicted labels (y_pred).
#High precision = when the model predicts a class, it is usually correct. Use precision when false positives are costly (e.g., marking a bad review as good).
# High recall = model successfully finds most of the actual items of that class. Use recall when false negatives are costly (e.g., missing a negative review in moderation).

print(classification_report(y_test, y_pred))

# Save results
df.to_csv('custom_ml_sentiment_analysis_results.csv', index=False)

print("Analysis complete. Results saved to 'custom_ml_sentiment_analysis_results.csv'.")


Collecting sklearn
  Using cached sklearn-0.0.post12.tar.gz (2.6 kB)
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py egg_info[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m See above for output.
  
  [1;35mnote[0m: This error originates from a subprocess, and is likely not a problem with pip.
  Preparing metadata (setup.py) ... [?25l[?25herror
[1;31merror[0m: [1mmetadata-generation-failed[0m

[31m×[0m Encountered error while generating package metadata.
[31m╰─>[0m See above for output.

[1;35mnote[0m: This is an issue with the package mentioned above, not pip.
[1;36mhint[0m: See above for details.
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstal

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Custom ML Model Classification Report:
              precision    recall  f1-score   support

    Negative       1.00      1.00      1.00        37
     Neutral       1.00      1.00      1.00         4
    Positive       1.00      1.00      1.00        59

    accuracy                           1.00       100
   macro avg       1.00      1.00      1.00       100
weighted avg       1.00      1.00      1.00       100

Analysis complete. Results saved to 'custom_ml_sentiment_analysis_results.csv'.
