In [1]:
# SENTIMENT ANALYSIS PLAN?

In [1]:
import pandas as pd
import pyarrow.parquet as pq
from tqdm.auto import tqdm

def read_parquet_in_batches_with_progress(file_path, batch_size):
    """
    Read a Parquet file in fixed-size row batches with a progress bar and per-chunk logging.

    Args:
        file_path (str): Path to the Parquet file.
        batch_size (int): Number of rows per batch.

    Returns:
        pd.DataFrame: Combined DataFrame after processing all batches.
    """
    # Open the Parquet file
    parquet_file = pq.ParquetFile(file_path)
    
    # Total number of rows in the file
    total_rows = parquet_file.metadata.num_rows
    
    # Initialize a list to store DataFrame chunks
    all_chunks = []
    
    # Initialize the progress bar
    with tqdm(total=total_rows, desc="Processing Batches", unit="rows") as pbar:
        # Enumerate batches for logging
        for batch_number, batch in enumerate(parquet_file.iter_batches(batch_size=batch_size), start=1):
            # Convert the batch to a Pandas DataFrame
            df_batch = batch.to_pandas()
            
            # Simulate processing (add your custom logic here)
            all_chunks.append(df_batch)
            
            # Update the progress bar
            pbar.update(len(df_batch))
            
            # Print per-chunk information
            print(f"Processed Chunk {batch_number}: {len(df_batch)} rows")
    
    # Combine all chunks into a single DataFrame
    combined_df = pd.concat(all_chunks, ignore_index=True)
    
    return combined_df

In [None]:
if __name__ == "__main__":
    file_path = "Data/2.Processed/ModellingData/P5_final_new.parquet"
    batch_size = 100_000  # Define your desired chunk size
    
    df = read_parquet_in_batches_with_progress(file_path, batch_size)
    
    print(f"\nFinal DataFrame with {len(df)} rows:")
    df.head()

1. Lexicon-Based Approach
1.1 Why a Lexicon-Based Method?
No training data required. It uses a dictionary (lexicon) of words mapped to sentiment scores (positive, negative, neutral).
Quick to implement, can provide a baseline or unsupervised vantage.
Commonly used library: VADER (suitable for short social media–style text but can be adapted) or SentiWordNet (a more general WordNet-based approach).
VADER (if your text is not deeply domain-specific)
Works decently on short, informal text.
If your text is more scientific (like PubMed titles/abstracts), you may find many neutral or domain words not recognized by VADER. Still, it can serve as a demonstration approach.

In [None]:
###############################################################################
# CODE: LEXICON-BASED (VADER)
###############################################################################
import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk

# 1) Download the VADER lexicon if needed
nltk.download('vader_lexicon')

def lexicon_based_vader(text):
    """
    Return sentiment scores for the given text using VADER.
    """
    sid = SentimentIntensityAnalyzer()
    scores = sid.polarity_scores(text)
    # scores is dict with {compound, neg, neu, pos}
    # For a single label, you might pick:
    # 'positive' if compound >= 0.05, 'negative' if compound <= -0.05, else 'neutral'
    return scores

# Suppose df has a column "title" with your text
# We'll create columns for VADER sentiment
df["vader_scores"] = df["title"].apply(lexicon_based_vader)

# Optionally parse them into one label
def get_vader_label(scores):
    compound = scores["compound"]
    if compound >= 0.05:
        return "positive"
    elif compound <= -0.05:
        return "negative"
    else:
        return "neutral"

df["vader_label"] = df["vader_scores"].apply(get_vader_label)

print(df[["title", "vader_scores", "vader_label"]].head())


1.2 Alternative: SentiWordNet
If your text is more formal or domain-based, you might want to explore SentiWordNet or a domain-specific lexicon. The logic is similar, but you’d look up each word’s positivity/negativity in the SentiWordNet dictionary, summing or averaging them.
Justification
Lexicon-based methods are quick for an unsupervised sentiment estimate.
They can fail in domain-specific contexts (e.g., biomedical text might mention “cancer” or “infection,” which are negative in a lay sense but may be neutral from a purely scientific standpoint).
This is why a supervised approach may be more accurate if you have labeled data.

2. Supervised Machine Learning Approach
2.1 Why a Supervised Method?
You’ll train a model on labeled examples of text → sentiment (pos/neg/neu).
This typically yields better results than lexicon-based if you have enough labeled data.
Common supervised classifiers: Logistic Regression, SVM, Naive Bayes, or even a fine-tuned BERT.
2.2 Example with Logistic Regression
Steps:

Gather labeled data: You need text + a sentiment label. Perhaps you label 1000+ random samples as positive/negative/neutral.
Feature extraction:
A simple approach uses TF-IDF or CountVectorizer on the text.
More advanced: use pretrained embeddings (e.g., BERT) as features.
Train a scikit-learn classifier (LogReg or SVM).
Predict on unseen text.
Example Code (Basic TF-IDF + Logistic Regression)

In [None]:
###############################################################################
# CODE: SUPERVISED (LOGREG + TF-IDF)
###############################################################################
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# Suppose df has columns: "title" (text) and "label" (pos/neg/neu)
# This is your labeled dataset

# 1) Split train/test
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# 2) TF-IDF on the text
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(train_df["title"])
y_train = train_df["label"]

X_test = vectorizer.transform(test_df["title"])
y_test = test_df["label"]

# 3) Train a Logistic Regression
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# 4) Evaluate
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))


Justification:

Logistic Regression is a common baseline supervised approach for sentiment analysis.
TF-IDF is quick to implement and typically effective for textual classification if domain-specific embeddings aren’t available.
2.3 Possible Upgrades
SVM or RandomForest: Slight variations in performance.
Neural approach with BERT-based embeddings: If you have enough data/time.
If your text is domain-specific, consider a SciBERT or BioBERT for embeddings or fine-tuning.
3. Combining or Reporting
You’re “obliged to perform sentiment analysis using at least one lexicon-based approach and at least one supervised technique.” So in your final report, you can:

Implement VADER for the lexicon-based method.
Implement Logistic Regression (or SVM) for the supervised method.
Compare performance on the same test set (requires labeled data for the supervised approach).
Justify choices:
VADER is quick, well-known, but might not handle domain terms well.
Logistic Regression is a standard baseline for text classification, interpretable, typically robust.
If your text is heavily domain-specific (scientific, biomedical), mention that both methods might have limitations:

Lexicon: might incorrectly classify domain words.
Logistic Regression: requires a domain-labeled dataset.
Still, this approach meets the stated requirement: one lexicon-based, one supervised approach, plus a reasoned justification.

4. Considering Neutral vs. Non-Neutral
If your data is primarily neutral (like many scientific abstracts), the distribution of sentiment might be heavily skewed. You can either:

Keep a 3-class system (pos/neg/neu).
Merge pos/neg into non-neutral vs. neutral.
Provide an analysis of how many are likely to be neutral, if that’s the main interest.
Either approach is valid, but note that heavily neutral data can reduce your classifier’s performance if you have few positive/negative examples.

5. Potential Models Summarized
Lexicon-based:

VADER if text is general or social media–like.
SentiWordNet or other dictionary for more formal text.
Bio domain: Possibly no major out-of-the-box lexicon for sentiment, so VADER is a fallback.
Supervised:

Logistic Regression or SVM with TF-IDF → easy to implement, relatively fast.
If large labeled data + domain complexity → fine-tuned BERT (e.g., SciBERT or BioBERT). But that’s more advanced in setup.

Conclusion
Yes, you can approach your data with two methods:
Lexicon-based (like VADER) for an unsupervised baseline,
Supervised (like Logistic Regression) for better domain accuracy if you have labeled data.
Keep in mind domain-specific challenges if your text is specialized.
If your data is mostly neutral, analyzing pos/neg signals might require a large labeled set or more sophisticated domain lexicons.
This satisfies the requirement to use “at least one lexicon-based approach and at least one supervised machine learning technique”, plus justification for each choice.
With these code blocks and the rationale above, you can implement both methods, compare them, and then summarize in your final report.