# Twitter Entity Sentiment Analysis  with Random Forest

## 1. Introduction

The goal is to determine the sentiment (Positive, Negative, or Neutral) expressed towards a specific *entity* within a given *tweet*. The dataset treats 'Irrelevant' tweets (those not related to the entity) as 'Neutral'. We will use a pre-trained transformer model, specifically  fine-tuned on the provided training data. The evaluation metric for the competition is Top 1 Classification Accuracy.

**Workflow:**

1.  **Setup:** Import libraries and define constants (model name, paths, hyperparameters).
2.  **Data Loading & Preprocessing:** Load training and validation data, handle missing values, map sentiment labels ('Irrelevant' to 'Neutral', then string labels to integers).
3.  **Model & Tokenizer Initialization:** Load the BERTweet model and tokenizer. Includes logic to download from Hugging Face Hub if internet is available, or load from a local save directory otherwise (useful for Kaggle environments).
4.  **Dataset Preparation:** Create a custom PyTorch `Dataset` class to format the input (Tweet + Entity) for the BERTweet model. Create `DataLoader` instances for batching and shuffling.
5.  **Training Setup:** Define the optimizer (AdamW) and a learning rate scheduler with warmup.
6.  **Training & Evaluation Functions:** Define functions for a single training epoch (`train_epoch`) and for evaluating the model (`eval_model`) on the validation set.
7.  **Training Loop:** Run the main training loop for a fixed number of epochs, evaluating the model after each epoch and saving the best performing checkpoint based on validation accuracy.
8.  **Conclusion:** Print the best validation accuracy achieved.

## 2. Setup and Configuration

First, we import the necessary libraries and define key configuration parameters for our experiment.

*   `pandas`, `numpy`: For data manipulation.
*   `os`, `shutil`, `socket`: For interacting with the file system and checking network connectivity.
*   `torch`: The core PyTorch library.
*   `transformers`: Hugging Face library for accessing pre-trained models (BERTweet), tokenizers, and utility functions like optimizers and schedulers.
*   `sklearn.metrics`: For calculating accuracy.
*   `tqdm`: For displaying progress bars during training and evaluation.

In [11]:
import pandas as pd
import numpy as np
import os
import torch
import shutil
import socket
from torch.nn import DataParallel
from torch.utils.data import Dataset, DataLoader
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    get_linear_schedule_with_warmup
)
from torch.optim import AdamW
from sklearn.metrics import accuracy_score
from tqdm.notebook import tqdm

### Configuration Parameters

We define constants for model name, data paths, save directory, and hyperparameters.

*   `MODEL_NAME`: We use `bertweet-base`, a BERT model pre-trained specifically on English Tweets, making it suitable for this task.
*   `TRAIN_PATH`, `VALID_PATH`: Paths to the competition's training and validation CSV files.
*   `SAVE_DIR`: Directory where the fine-tuned model and tokenizer will be saved.
*   `MAX_LEN`: Maximum sequence length for tokenization. Tweets longer than this will be truncated.
*   `BATCH_SIZE`: Number of samples processed in each training step. Adjust based on GPU memory.
*   `EPOCHS`: Number of times to iterate over the entire training dataset.
*   `LEARNING_RATE`: The learning rate for the AdamW optimizer.

## 3. Data Loading and Preprocessing

We load the training and validation datasets using pandas. The competition dataset doesn't have headers, so we provide column names.

**Preprocessing Steps:**

1.  **Load Data:** Read CSV files into pandas DataFrames.
2.  **Map 'Irrelevant' to 'Neutral':** As per the competition description, we merge the 'Irrelevant' class into the 'Neutral' class.
3.  **Handle Missing Values:** Drop rows where the 'Tweet' or 'Sentiment' column is missing.
4.  **Create Numerical Labels:** Map the string sentiment labels ('Positive', 'Negative', 'Neutral') to integers (0, 1, 2) for model training.
5.  **Verify Labels:** Check for any unexpected sentiment values after mapping and handle potential NaNs in the 'label' column if necessary.

In [12]:
#
import pandas as pd
# Constructing the direct download link
file_id = '1-FqkfXCQ2_O7zSkODq3TVTajNyzUJiG0'
direct_download_url = f'https://drive.google.com/uc?export=download&id={file_id}'

# Use pandas to read the CSV directly from the URL
try:
    df = pd.read_csv(direct_download_url)
    print("File loaded successfully!")
    print(df.head())
    train_df = df.copy()
except Exception as e:
    print(f"Error loading file: {e}")
    print("Please ensure the file is shareable with 'Anyone with the link' and the ID is correct.")




File loaded successfully!
   2401  Borderlands  Positive  \
0  2401  Borderlands  Positive   
1  2401  Borderlands  Positive   
2  2401  Borderlands  Positive   
3  2401  Borderlands  Positive   
4  2401  Borderlands  Positive   

  im getting on borderlands and i will murder you all ,  
0  I am coming to the borders and I will kill you...     
1  im getting on borderlands and i will kill you ...     
2  im coming on borderlands and i will murder you...     
3  im getting on borderlands 2 and i will murder ...     
4  im getting into borderlands and i can murder y...     


In [13]:


train_df
train_df.describe()
train_df.columns

Index(['2401', 'Borderlands', 'Positive',
       'im getting on borderlands and i will murder you all ,'],
      dtype='object')

In [14]:
#  add header for the above dataset

import pandas as pd

# But in this case, assuming the CSV has a header row:
print("DataFrame head with existing header:")
print(train_df.head())

# You can also inspect the column names directly
print("\nDataFrame columns:")
print(train_df.columns.tolist())

DataFrame head with existing header:
   2401  Borderlands  Positive  \
0  2401  Borderlands  Positive   
1  2401  Borderlands  Positive   
2  2401  Borderlands  Positive   
3  2401  Borderlands  Positive   
4  2401  Borderlands  Positive   

  im getting on borderlands and i will murder you all ,  
0  I am coming to the borders and I will kill you...     
1  im getting on borderlands and i will kill you ...     
2  im coming on borderlands and i will murder you...     
3  im getting on borderlands 2 and i will murder ...     
4  im getting into borderlands and i can murder y...     

DataFrame columns:
['2401', 'Borderlands', 'Positive', 'im getting on borderlands and i will murder you all ,']


In [15]:


import pandas as pd


TRAIN_PATH = '1-FqkfXCQ2_O7zSkODq3TVTajNyzUJiG0'
direct_download_url = f'https://drive.google.com/uc?export=download&id={file_id}'

VALID_PATH='1-FqkfXCQ2_O7zSkODq3TVTajNyzUJiG0'

# Constructing the direct download link
file_id = '1-FqkfXCQ2_O7zSkODq3TVTajNyzUJiG0'
direct_download_url = f'https://drive.google.com/uc?export=download&id={file_id}'

# Define column names based on the dataset structure
COLUMN_NAMES = ['Tweet ID', 'Entity', 'Sentiment', 'TweetContent']

try:

    train_df = pd.read_csv(direct_download_url,header=None, names=COLUMN_NAMES)
    print("Training data loaded successfully!")
    print(train_df.head())
    print("\nTraining data info:")
    train_df.info()
except FileNotFoundError:
    print(f"Error: Training data file not found at {TRAIN_PATH}")
    train_df = pd.DataFrame(columns=COLUMN_NAMES) # Create empty DataFrame to avoid errors later

# Load validation data
try:
    #valid_df = pd.read_csv(VALID_PATH, header=None, names=COLUMN_NAMES)
    valid_df = pd.read_csv(direct_download_url,header=None, names=COLUMN_NAMES)
    print("\nValidation data loaded successfully!")
    print(valid_df.head())
    print("\nValidation data info:")
    valid_df.info()
except FileNotFoundError:
    print(f"Error: Validation data file not found at {VALID_PATH}")
    valid_df = pd.DataFrame(columns=COLUMN_NAMES) # Create empty DataFrame to avoid errors later


# Map 'Irrelevant' to 'Neutral' and handle missing values
if not train_df.empty:
    train_df['Sentiment'] = train_df['Sentiment'].replace('Irrelevant', 'Neutral')
    train_df.dropna(subset=['TweetContent', 'Sentiment'], inplace=True)
    print("\nTraining data after preprocessing:")
    print(train_df['Sentiment'].value_counts())
    print(f"Training data shape after dropna: {train_df.shape}")

if not valid_df.empty:
    valid_df['Sentiment'] = valid_df['Sentiment'].replace('Irrelevant', 'Neutral')
    valid_df.dropna(subset=['TweetContent', 'Sentiment'], inplace=True)
    print("\nValidation data after preprocessing:")
    print(valid_df['Sentiment'].value_counts())
    print(f"Validation data shape after dropna: {valid_df.shape}")


# Create numerical labels
SENTIMENT_MAP = {'Positive': 0, 'Negative': 1, 'Neutral': 2}
if not train_df.empty:
    train_df['label'] = train_df['Sentiment'].map(SENTIMENT_MAP)
    # Verify labels - check for any NaNs that might result from unexpected values
    if train_df['label'].isnull().any():
        print("Warning: Found rows with unexpected sentiment values in training data after mapping.")
        print(train_df[train_df['label'].isnull()]['Sentiment'].unique())
        # Optionally drop these rows or investigate
        train_df.dropna(subset=['label'], inplace=True)
        print(f"Training data shape after dropping rows with invalid labels: {train_df.shape}")


if not valid_df.empty:
    valid_df['label'] = valid_df['Sentiment'].map(SENTIMENT_MAP)
    # Verify labels - check for any NaNs that might result from unexpected values
    if valid_df['label'].isnull().any():
        print("Warning: Found rows with unexpected sentiment values in validation data after mapping.")
        print(valid_df[valid_df['label'].isnull()]['Sentiment'].unique())
        # Optionally drop these rows or investigate
        valid_df.dropna(subset=['label'], inplace=True)
        print(f"Validation data shape after dropping rows with invalid labels: {valid_df.shape}")


print("\nData Loading and Preprocessing Complete.")
if not train_df.empty:
  print("Sample training data with labels:")
  print(train_df[['TweetContent', 'Sentiment', 'label']].head())
if not valid_df.empty:
  print("\nSample validation data with labels:")
  print(valid_df[['TweetContent', 'Sentiment', 'label']].head())



Training data loaded successfully!
   Tweet ID       Entity Sentiment  \
0      2401  Borderlands  Positive   
1      2401  Borderlands  Positive   
2      2401  Borderlands  Positive   
3      2401  Borderlands  Positive   
4      2401  Borderlands  Positive   

                                        TweetContent  
0  im getting on borderlands and i will murder yo...  
1  I am coming to the borders and I will kill you...  
2  im getting on borderlands and i will kill you ...  
3  im coming on borderlands and i will murder you...  
4  im getting on borderlands 2 and i will murder ...  

Training data info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 74682 entries, 0 to 74681
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Tweet ID      74682 non-null  int64 
 1   Entity        74682 non-null  object
 2   Sentiment     74682 non-null  object
 3   TweetContent  73996 non-null  object
dtypes: int64(1), object(

Hugging Face = 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

In [None]:
#perfrom random forest analysis

import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score

# Assuming train_df and valid_df are already loaded and preprocessed as in the preceding code

if train_df.empty or valid_df.empty:
    print("Training or validation data is empty. Cannot perform Random Forest analysis.")
else:
    # Combine training and validation data for TF-IDF vectorization
    all_tweets = pd.concat([train_df['TweetContent'], valid_df['TweetContent']], axis=0)

    # Create TF-IDF features
    # Consider using bigrams or trigrams for better representation
    vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
    X_combined = vectorizer.fit_transform(all_tweets)

    # Split features back into training and validation sets
    X_train = X_combined[:len(train_df)]
    X_valid = X_combined[len(train_df):]

    y_train = train_df['label']
    y_valid = valid_df['label']

    # Initialize and train the Random Forest classifier
    # You can tune n_estimators, max_depth, etc.
    rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
    rf_classifier.fit(X_train, y_train)

    # Make predictions on the validation set
    y_pred = rf_classifier.predict(X_valid)

    # Evaluate the model
    accuracy = accuracy_score(y_valid, y_pred)
    print(f"\nRandom Forest Validation Accuracy: {accuracy:.4f}")

    # Optional: Display a classification report
    from sklearn.metrics import classification_report
    print("\nRandom Forest Classification Report:")
    print(classification_report(y_valid, y_pred, target_names=SENTIMENT_MAP.keys()))


In [None]:
import matplotlib.pyplot as plt
# If you have the actual predictions (y_pred) and true labels (y_valid)
if 'y_valid' in locals() and 'y_pred' in locals():
    # Visualize the confusion matrix
    cm = confusion_matrix(y_valid, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=SENTIMENT_MAP.keys(), yticklabels=SENTIMENT_MAP.keys())
    plt.title('Confusion Matrix for Random Forest Model')
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.show()
else:
    print("Cannot generate visualizations. Validation DataFrame or predictions are not available.")


#  Iphone Entity Sentiment Analysis with Random Forest

## 1. Introduction

The goal of this project is to determine the sentiment (Positive, Negative, or Neutral) expressed toward a specific entity mentioned in an iPhone review dataset.

Unlike deep learning transformer-based models, we utilize a Random Forest classifier—a traditional machine learning approach. This enables efficient training on modest computational resources while maintaining solid performance in textual classification tasks.


The dataset titled Iphone.csv consists of 3,062 rows representing customer reviews for various iPhone models sold primarily in India. It includes detailed metadata for each review such as the productAsin (a unique product identifier), the country of the review, the date it was posted, and a boolean flag isVerified indicating whether the purchase was verified. One of the central features is ratingScore, a numerical value from 1 to 5 reflecting the customer’s satisfaction. Accompanying this rating are two textual fields—reviewTitle and reviewDescription—which provide qualitative feedback and are essential for natural language processing tasks like sentiment analysis.

Each review entry also contains a reviewUrl linking to the full review on Amazon and a reviewedIn field that combines date and country in natural language format. Furthermore, the dataset captures product variation details in the variant and variantAsin columns, specifying color and storage options. The reviewDescription column, which contains the core text of the user’s feedback, is key for training and evaluating sentiment models. However, some entries in this column are missing (about 2.8%), and they must be handled during preprocessing. Overall, the dataset is well-suited for supervised learning, especially for classifying review sentiments based on star ratings or predicting user feedback using natural language techniques.

# iPhone Review Sentiment Analysis Workflow (Random Forest Classifier)

This document outlines the complete workflow to process the `iphone (1).csv` dataset, apply sentiment labeling, and train a machine learning model to predict sentiment based on user reviews.

---

##  Step 1: Data Loading & Exploration
- Load the dataset using `pandas`.
- Examine columns, data types, and null values.
- Focus on two fields:
  - `reviewDescription`: The review text (input).
  - `ratingScore`: Used to derive sentiment (target).

---

##  Step 2: Sentiment Labeling
Convert `ratingScore` into sentiment categories:
- `1` or `2` → **Negative**
- `3` → **Neutral**
- `4` or `5` → **Positive**

Drop entries where `reviewDescription` is missing.

---

##  Step 3: Text Preprocessing
- Convert text to lowercase.
- Remove punctuation, special characters, and numbers.
- Remove stopwords (e.g., "the", "and", "is").
- (Optional) Apply stemming or lemmatization to normalize words.

---

##  Step 4: Feature Engineering
- Use `CountVectorizer` to convert text to numerical vectors (Bag-of-Words).
- Consider n-grams (bigrams or trigrams) for richer context.
- Optionally use `TfidfVectorizer` for better word importance scaling.

---

##  Step 5: Model Training
- Split the data into training and test sets.
- Train a **Random Forest Classifier** using the feature matrix.
- Use cross-validation for robustness.

---

##  Step 6: Model Evaluation
- Predict sentiments on the test data.
- Evaluate using:
  - Accuracy
  - Precision, Recall, F1-score
  - Confusion Matrix
- Visualize performance using seaborn/matplotlib.

---

##  Step 7: Interpretation
- Display top contributing words (features) from the Random Forest model.
- Analyze:
  - Class imbalance (e.g., more positive reviews than negative)
  - Error patterns (e.g., confusion between neutral and positive)

---

##  Step 8: Deployment
- Wrap the model into a FastAPI or MCP tool for API access.
- Use with Claude UI for query-based summarization or insights.

---

This workflow combines classical machine learning with light text processing and allows fast evaluation of user sentiment toward iPhone products.


In [38]:
#  Load Data
import pandas as pd
from google.colab import drive
drive.mount('/content/drive')
df = pd.read_csv('/content/drive/My Drive/Colab Notebooks/aai-510/project/Iphone.csv')
df = df.dropna(subset=['reviewDescription', 'ratingScore']).fillna('Unknown')
df['isGood'] = (df['ratingScore'] >= 4).astype(int)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [39]:

!pip install nltk transformers torch
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
from sklearn.feature_extraction.text import TfidfVectorizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng') # Corrected resource name

# Initialize Lemmatizer
lemmatizer = WordNetLemmatizer()

# Function to convert NLTK POS tags to WordNet POS tags for lemmatization
def get_wordnet_pos(word):
    """Maps NLTK POS tags to WordNet tags."""
    tag = nltk.pos_tag([word])[0][1][0].upper()
    tag_dict = {"J": wordnet.ADJ,
                "N": wordnet.NOUN,
                "V": wordnet.VERB,
                "R": wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

# Function to tokenize and lemmatize text
def tokenize_and_lemmatize(text):
    tokens = nltk.word_tokenize(text)
    lemmatized_tokens = [lemmatizer.lemmatize(t, get_wordnet_pos(t)) for t in tokens]
    return " ".join(lemmatized_tokens)

# Apply tokenization and lemmatization
df['processed_review'] = df['reviewDescription'].apply(tokenize_and_lemmatize)

# Create TF-IDF Embeddings
tfidf_vectorizer = TfidfVectorizer(max_features=5000) # You can adjust max_features
tfidf_matrix = tfidf_vectorizer.fit_transform(df['processed_review'])

print("Original reviews:")
print(df['reviewDescription'].head())

print("\nProcessed (Tokenized and Lemmatized) reviews:")
print(df['processed_review'].head())

print("\nShape of TF-IDF matrix (Number of reviews, Number of features):")
print(tfidf_matrix.shape)

# Display a small portion of the TF-IDF matrix and feature names
print("\nSample TF-IDF values:")
print(tfidf_matrix[:2].toarray())

print("\nSample TF-IDF feature names (vocabulary):")
print(tfidf_vectorizer.get_feature_names_out()[:20])







[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Original reviews:
0    Every thing is good about iPhones, there's not...
1    It look so fabulous, I am android user switche...
2    I tried to flip camera while recording but no ...
3                                         100% genuine
4           Happy to get the iPhone 13 in Amazon offer
Name: reviewDescription, dtype: object

Processed (Tokenized and Lemmatized) reviews:
0    Every thing be good about iPhones , there 's n...
1    It look so fabulous , I be android user switch...
2    I try to flip camera while record but no facil...
3                                        100 % genuine
4           Happy to get the iPhone 13 in Amazon offer
Name: processed_review, dtype: object

Shape of TF-IDF matrix (Number of reviews, Number of features):
(2976, 5000)

Sample TF-IDF values:
[[0. 0. 0. ... 0. 0. 0.]
 [0. 0. 0. ... 0. 0. 0.]]

Sample TF-IDF feature names (vocabulary):
['00' '000' '02' '03' '05' '06' '09' '0display' '10' '100' '1000'
 '100percent' '10battery' '10camera' '10display

In [None]:
import matplotlib.pyplot as plt
# Prepare data for Random Forest
X = tfidf_matrix  # Features from TF-IDF
y = df['isGood']  # Target variable (0 for bad, 1 for good)

# Split data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Stratify helps maintain proportion of good/bad reviews

print("\nShape of training features:", X_train.shape)
print("Shape of testing features:", X_test.shape)
print("Shape of training labels:", y_train.shape)
print("Shape of testing labels:", y_test.shape)


# Initialize and train the Random Forest Classifier
rf_model = RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1, class_weight='balanced') # Added class_weight='balanced' to handle potential class imbalance

print("\nTraining Random Forest model...")
rf_model.fit(X_train, y_train)
print("Random Forest model training complete.")

# Make predictions on the testing set
y_pred = rf_model.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"\nRandom Forest Test Accuracy: {accuracy:.4f}")

# Print classification report for more detailed evaluation
print("\nRandom Forest Classification Report:")
print(classification_report(y_test, y_pred, target_names=['Bad Review', 'Good Review']))

# Visualize the confusion matrix
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Bad Review', 'Good Review'], yticklabels=['Bad Review', 'Good Review'])
plt.title('Confusion Matrix for Random Forest Model')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()
