YouTube Live Chat Sentiment Analysis

A real-time sentiment analysis application for YouTube live chat messages using Python, transformers, and Streamlit.

📋 Project Overview

This project demonstrates a complete machine learning pipeline that:

Collects live chat messages from YouTube livestreams
Preprocesses text data (removing URLs, emojis, special characters)
Performs sentiment analysis using HuggingFace Transformers
Classifies messages as Positive, Negative, or Neutral
Displays real-time analytics in an interactive Streamlit dashboard

Use Cases:

Monitor audience sentiment during live events
Understand viewer reactions in real-time
Analyze content engagement and reception
Academic prototype for NLP/ML projects

🏗️ System Architecture

YouTube Live Chat Source
        ↓
Chat Collection Module (pytchat)
        ↓
Live Message Stream
        ↓
Text Preprocessing
  ├─ Remove URLs
  ├─ Remove Emojis
  ├─ Remove Special Characters
  ├─ Normalize Whitespace
  └─ Lowercase
        ↓
Sentiment Analysis Model
  (HuggingFace Transformers)
        ↓
Sentiment Classification
  ├─ POSITIVE (score > 0.7)
  ├─ NEUTRAL (0.4 ≤ score ≤ 0.7)
  └─ NEGATIVE (score < 0.4)
        ↓
Statistics Aggregation
  ├─ Count by sentiment
  ├─ Percentages
  └─ Confidence scores
        ↓
Visualization Dashboard (Streamlit)
  ├─ Sentiment counters
  ├─ Bar chart
  ├─ Pie chart
  ├─ Message feed
  └─ Statistics

📁 Project Structure

youtube-chat-sentiment/
├── app.py                 # Main Streamlit application
├── chat_collector.py      # YouTube live chat collection
├── preprocessing.py       # Text preprocessing module
├── sentiment_model.py     # Sentiment analysis with transformers
├── visualization.py       # Dashboard visualizations
├── requirements.txt       # Python dependencies
└── README.md             # This file

Module Descriptions

`app.py`

Main Streamlit application that orchestrates the entire pipeline. Features:

Sidebar controls for video ID input and chat collection
Three main tabs: Dashboard, Watch, Messages
Real-time sentiment counters and statistics
Interactive charts (bar and pie)
Message feed display
Data export functionality (CSV)

`chat_collector.py`

Handles YouTube live chat collection using pytchat library:

YouTubeChattCollector class with connect/disconnect methods
Stream-based message retrieval
Error handling and logging
Message format: {author, message, timestamp}

`preprocessing.py`

Text preprocessing pipeline:

TextPreprocessor class with modular preprocessing steps
URL removal
Emoji removal
Special character removal
Whitespace normalization
Lowercase conversion
Batch processing support

`sentiment_model.py`

Sentiment analysis using pre-trained transformers:

SentimentAnalyzer class using HuggingFace models
GPU/CPU device management
Three-class sentiment mapping (POSITIVE/NEGATIVE/NEUTRAL)
Batch processing capability
Confidence scores

`visualization.py`

Streamlit-based visualization components:

Sentiment counters with icons
Bar chart visualization
Pie chart visualization
Recent messages table
Overall statistics display
CSV export functionality

🛠️ Installation

Prerequisites

Python 3.10 or higher
pip (Python package manager)
Internet connection (for downloading models)
YouTube livestream URL/Video ID

Step 1: Clone or Download the Project

cd "/path/to/youtube-chat-sentiment"

Step 2: Create Virtual Environment (Recommended)

# Windows
python -m venv venv
venv\Scripts\activate

# macOS/Linux
python3 -m venv venv
source venv/bin/activate

Step 3: Install Dependencies

pip install -r requirements.txt

This will install:

pytchat (1.5.7): YouTube live chat scraping
streamlit (1.28.1): Interactive dashboard
transformers (4.38.1): Pre-trained NLP models
torch (2.1.2): PyTorch deep learning framework
pandas (2.1.4): Data manipulation and analysis
matplotlib (3.8.2): Data visualization
regex (2023.12.25): Regular expression library

Note: First download will take a few minutes as the transformer model is cached locally (~500MB).

🚀 How to Run

Option 1: Local Development

# Ensure you're in the virtual environment
# Windows
venv\Scripts\activate

# macOS/Linux
source venv/bin/activate

# Run the Streamlit app
streamlit run app.py

The app will open in your browser at http://localhost:8501

Option 2: From Command Line

python -m streamlit run app.py

📖 Usage Guide

Getting a YouTube Video ID

Open a YouTube livestream URL

Copy the video ID from the URL:

https://www.youtube.com/watch?v=dQw4w9WgXcQ
                                 ^^^^^^^^^^^
                                 This is the Video ID

Running the Application

Open Streamlit App
- Run: streamlit run app.py
- Browser opens automatically at http://localhost:8501
Enter YouTube Video ID
- Paste the video ID in the sidebar input field
- Example: dQw4w9WgXcQ
Start Collecting Messages
- Click the "▶️ Start Collecting" button in the sidebar
- Wait for connection confirmation (✅ or ❌)
Monitor in Real-Time
- Watch sentiment counters update automatically
- View live statistics on the Dashboard tab
- Check the Watch tab for status updates
- Browse individual messages on the Messages tab
Export Analysis
- Click "📥 Download analyzed messages as CSV" to export results
- Data includes: Author, Original Message, Cleaned Text, Sentiment, Confidence, Timestamp

💡 Example Output

Input Message

"STREAMER INI LUCU BANGET!!! 😂😂 https://youtube.com"

Processing Steps

Preprocessing:
- Input: "STREAMER INI LUCU BANGET!!! 😂😂 https://youtube.com"
- Output: "streamer ini lucu banget"
Sentiment Analysis:
- Model Output: POSITIVE (confidence: 0.85)
- Classification: POSITIVE (score > 0.7)
Dashboard Update:
- Positive counter: +1
- Statistics: Updated percentages
- Charts: Regenerated visualizations

Dashboard Metrics

Positive: 120    (54%)
Negative: 35     (16%)
Neutral:  50     (22%)
─────────────────────
Total:    205

🔬 Technical Details

Sentiment Analysis Model

Current Model: Custom Fine-Tuned DistilBERT

The application now uses a custom fine-tuned DistilBERT model trained on Twitter sentiment data for improved performance.

Model Location: ./fine_tuned_distilbert_sentiment/

Training Details:

Base Model: distilbert-base-uncased
Dataset: Kaggle - Twitter Entity Sentiment Analysis (74,681 samples)
Training Split: 80% training, 10% validation, 10% test (stratified)
Epochs: 3
Batch Size: 16
Learning Rate: 2e-5
Max Sequence Length: 128
Optimizer: AdamW with weight decay (0.01)
Scheduler: Linear warmup (500 steps)

Model Performance:

Test Accuracy: 90.2%
Test Precision: 0.902
Test Recall: 0.902
Test F1-Score: 0.902

Model Specs:

Type: DistilBERT (Distilled BERT)
Task: Sequence classification for sentiment analysis
Dimension: 768 hidden dimensions
Parameters: ~66 million
Size: ~250 MB
Speed: ~100-500 messages/second (depending on hardware)
Device: Automatically uses CUDA (GPU) if available, falls back to CPU

Fallback Model

If the fine-tuned model is not available, the application automatically uses:

Model: distilbert-base-uncased-finetuned-sst-2-english
Accuracy: 91% on SST-2 dataset

Classification Logic

Model Output: {label: "POSITIVE"/"NEGATIVE", score: 0.0-1.0}

Mapping to 3-class:
├─ POSITIVE (model) + score > 0.7 → POSITIVE
├─ POSITIVE (model) + score ≤ 0.7 → NEUTRAL
├─ NEGATIVE (model) + score > 0.7 → NEGATIVE
└─ NEGATIVE (model) + score ≤ 0.7 → NEUTRAL

Preprocessing Pipeline

Input: "STREAMER INI LUCU BANGET!!! 😂😂 https://youtube.com"
  ↓ [lowercase]
"streamer ini lucu banget!!! 😂😂 https://youtube.com"
  ↓ [remove_urls]
"streamer ini lucu banget!!! 😂😂"
  ↓ [remove_emojis]
"streamer ini lucu banget!!!"
  ↓ [remove_special_characters]
"streamer ini lucu banget"
  ↓ [remove_extra_whitespace]
"streamer ini lucu banget"

🔋 Performance Considerations

GPU Acceleration: Automatically uses CUDA if available
Batch Processing: Can process multiple messages simultaneously
Memory: ~2-3 GB RAM for models + message buffer
Network: Requires persistent connection to YouTube
Scalability: Can analyze ~100-500 messages/second

🐛 Troubleshooting

Issue: "Failed to connect to livestream"

Solution:

Verify the YouTube video ID is correct
Ensure the video is an active livestream (not a regular video)
Check your internet connection

Issue: "ModuleNotFoundError: No module named 'pytchat'"

Solution:

pip install pytchat==1.5.7

Issue: "CUDA out of memory" (GPU users)

Solution:

Use CPU instead (automatic fallback available)
Reduce batch size in code

Issue: "No messages appearing"

Solution:

Verify the livestream is active and has chat enabled
Check browser console for errors
Try a different livestream video ID

Issue: Slow sentiment analysis

Solution:

First run downloads the model (~500MB) - this is normal
Subsequent runs are much faster
Ensure sufficient RAM (4GB+ recommended)

📊 Analysis Use Cases

Real-Time Monitoring

Track audience sentiment during live events
Identify negative feedback immediately
Monitor engagement levels

Content Analysis

Understand which topics trigger positive/negative responses
Evaluate content quality through sentiment trends
A/B test different content approaches

Community Management

Identify toxic comments early
Respond to negative sentiment promptly
Celebrate positive audience reactions

Research

Collect datasets for NLP research
Study sentiment patterns in live streaming
Analyze multilingual sentiment (with appropriate models)

🎓 Model Training (Fine-Tuning)

To train a custom fine-tuned model:

Prepare Training Notebook
- Use fine_tune_model.ipynb Jupyter notebook
- Requires kagglehub authentication for dataset download
Install Notebook Dependencies
```
pip install jupyter kagglehub
```
Run Fine-Tuning
```
jupyter notebook fine_tune_model.ipynb
```
Model Export
- Fine-tuned model automatically exports to ./fine_tuned_distilbert_sentiment/
- Includes: model weights, tokenizer, config, and label mappings
Integration
- sentiment_model.py automatically detects and loads the fine-tuned model
- Falls back to pre-trained model if fine-tuned version unavailable

🔒 Privacy & Ethics

Data Collection: Only collects publicly available chat messages from livestreams
Data Storage: Messages stored locally in session state (not persisted to disk by default)
Model Bias: DistilBERT may have biases from training data - use with awareness
Usage: Comply with YouTube's Terms of Service when using pytchat

🔄 Future Enhancements

Multi-language support with multilingual models
Real-time alerts for negative sentiment spikes
User authentication and session persistence
Database integration for historical analysis
Advanced NLP features (topic modeling, emotion detection)
Custom model fine-tuning (Completed - DistilBERT fine-tuned on Twitter sentiment data)
HuggingFace Hub model deployment
API endpoint for integration with other services
Mobile app support

📚 References

pytchat Documentation: https://github.com/taizan-hokuto/pytchat
HuggingFace Transformers: https://huggingface.co/docs/transformers
DistilBERT Model Card: https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english
Streamlit Documentation: https://docs.streamlit.io
sentencepiece Tokenizer: https://github.com/google/sentencepiece

📄 License

This project is provided as an educational prototype. Use freely for learning and academic purposes.

🤝 Contributing

Contributions welcome! Feel free to:

Report bugs and issues
Suggest new features
Improve documentation
Optimize code performance

📞 Support

For issues or questions:

Check the Troubleshooting section above
Review code comments in individual modules
Verify all dependencies are correctly installed
Check console logs for detailed error messages

✨ Highlights

Complete ML Pipeline: Data collection → preprocessing → analysis → visualization
Production-Ready: Error handling, logging, and robust design
User-Friendly: Intuitive Streamlit interface with real-time updates
Scalable Architecture: Modular design allows easy customization
Educational Value: Well-documented code suitable for learning ML/NLP concepts

Built with ❤️ for sentiment analysis and real-time insights

Last Updated: March 14, 2026
Model Version: v1.1 (Fine-Tuned DistilBERT)
Python Version: 3.10+
Status: ✅ Production Ready with Custom Fine-Tuned Model

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
.gitignore		.gitignore
README.md		README.md
app.py		app.py
chat_collector.py		chat_collector.py
fine_tune_model.ipynb		fine_tune_model.ipynb
preprocessing.py		preprocessing.py
requirements.txt		requirements.txt
runtime.txt		runtime.txt
sentiment_model.py		sentiment_model.py
visualization.py		visualization.py

Folders and files

Latest commit

History

Repository files navigation

YouTube Live Chat Sentiment Analysis

📋 Project Overview

🏗️ System Architecture

📁 Project Structure

Module Descriptions

app.py

chat_collector.py

preprocessing.py

sentiment_model.py

visualization.py

🛠️ Installation

Prerequisites

Step 1: Clone or Download the Project

Step 2: Create Virtual Environment (Recommended)

Step 3: Install Dependencies

🚀 How to Run

Option 1: Local Development

Option 2: From Command Line

📖 Usage Guide

Getting a YouTube Video ID

Running the Application

💡 Example Output

Input Message

Processing Steps

Dashboard Metrics

🔬 Technical Details

Sentiment Analysis Model

Current Model: Custom Fine-Tuned DistilBERT

Fallback Model

Classification Logic

Preprocessing Pipeline

🔋 Performance Considerations

🐛 Troubleshooting

Issue: "Failed to connect to livestream"

Issue: "ModuleNotFoundError: No module named 'pytchat'"

Issue: "CUDA out of memory" (GPU users)

Issue: "No messages appearing"

Issue: Slow sentiment analysis

📊 Analysis Use Cases

Real-Time Monitoring

Content Analysis

Community Management

Research

🎓 Model Training (Fine-Tuning)

🔒 Privacy & Ethics

🔄 Future Enhancements

📚 References

📄 License

🤝 Contributing

📞 Support

✨ Highlights

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`app.py`

`chat_collector.py`

`preprocessing.py`

`sentiment_model.py`

`visualization.py`

Packages