Sentiment Analysis with Hugging Face Transformers and PyTorch

This repository provides a minimal yet realistic example of using Hugging Face Transformers with PyTorch to train and deploy a sentiment analysis model.

The project demonstrates how to:

Fine-tune a pre-trained language model (DistilBERT) on a binary sentiment classification task.
Use Trainer and TrainingArguments from Hugging Face to abstract away boilerplate code.
Leverage PyTorch as the training backend.
Run inference using a simple script with the Hugging Face pipeline.

This is an ideal template for beginners who want to understand how Hugging Face and PyTorch work together.

🎯 Workflow Overview

This project implements a complete end-to-end machine learning pipeline, from data collection to model deployment. The diagram below illustrates the entire workflow and architecture:

Key Pipeline Stages:

Data Collection: Gather real-world text data from Reddit using PRAW API
Data Preprocessing: Clean text and generate pseudo-labels using pre-trained models
Model Training: Fine-tune DistilBERT using PyTorch and Hugging Face Trainer
Inference: Deploy the trained model for sentiment prediction

Architecture Layers:

Application Layer: Your Python scripts for each pipeline stage
Hugging Face Layer: High-level ML abstractions (AutoTokenizer, Trainer, Pipeline)
PyTorch Layer: Core deep learning operations (tensors, autograd, optimization)
Hardware Layer: CPU/GPU processing and memory management

This layered approach demonstrates how modern NLP workflows integrate multiple frameworks to achieve both ease of use and powerful functionality.

📁 Project Structure

├── data/
│   └── processed/
│       └── labeled_dataset.csv     # Cleaned dataset with `text` and `label` columns
├── models/
│   └── distilbert-finetuned/       # Output directory created by Trainer; contains trained model
├── scripts/
│   ├── collect_data.py             # Data collection script ([📖 Documentation](scripts/collect_data.md))
│   ├── preprocess_data.py          # Data cleaning and formatting script ([📖 Documentation](scripts/preprocess_data.md))
│   ├── train_model.py              # Script to train the model ([📖 Documentation](scripts/train_model.md))
│   └── run_inference.py           # Script to perform inference on sample texts ([📖 Documentation](scripts/run_inference.md))
├── requirements.txt                # Python dependencies
└── README.md                       # This file

🚀 Quickstart

1. Prepare Environment

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt

2. Setup Reddit API and Configuration

Before collecting data, you need to create a Reddit developer application and configure your API credentials.

A. Create Reddit Developer Application

Go to Reddit Apps: Visit https://www.reddit.com/prefs/apps
Login: Sign in with your Reddit account (create one if needed)
Create New App: Click "Create App" or "Create Another App"
Fill Application Details:
- Name: sentiment-analysis-bot (or any name you prefer)
- App type: Select "script" (for personal use scripts)
- Description: Data collection for sentiment analysis research
- About URL: Leave blank or add your GitHub repo
- Redirect URI: http://localhost:8080 (required but not used for scripts)
Create App: Click "Create app"
Get Credentials: Note down your client_id (under the app name) and client_secret

B. Create Configuration File

Create a config.yaml file in the project root with your Reddit API credentials. You can use the provided template:

# Copy the example configuration file
cp config.yaml.example config.yaml

# Edit with your credentials
# Use your preferred text editor (notepad, VSCode, nano, etc.)
notepad config.yaml  # Windows
# or
code config.yaml     # VSCode

Configuration Template (config.yaml):

reddit:
  client_id: "your_client_id_here"          # Found under your app name
  client_secret: "your_client_secret_here"  # The "secret" field
  user_agent: "sentiment-analysis-bot/1.0 by YourUsername"  # Identify your bot
  subreddit: "technology"                   # Subreddit to collect from
  limit: 100                               # Number of comments to collect
  output_path: "data/raw/reddit_comments.json"  # Where to save raw data

How to fill in the values:

client_id: Copy from Reddit app page (appears under the app name)
client_secret: Copy the "secret" field from Reddit app page
user_agent: Replace YourUsername with your Reddit username
subreddit: Choose any public subreddit (try "technology", "movies", "gaming")
limit: Start with 50-100 for testing, increase later for more data

Important Security Notes:

✅ The config.yaml file is already in .gitignore to prevent accidental commits
✅ Never share your client_secret publicly
✅ Use descriptive but respectful user_agent strings
✅ Replace YourUsername with your actual Reddit username

C. Test Your Configuration

# Test data collection (collect a small sample first)
python scripts/collect_data.py

3. Complete Data Pipeline

# Step 1: Collect raw data from Reddit
python scripts/collect_data.py

# Step 2: Clean and label the collected data
python scripts/preprocess_data.py

# Step 3: Train the sentiment analysis model
python scripts/train_model.py

This will:

Collect comments from Reddit using your API credentials
Clean and automatically label the text data
Load the dataset from data/processed/labeled_dataset.csv
Tokenize the texts
Fine-tune distilbert-base-uncased
Save the model to models/distilbert-finetuned

4. Run Inference

python scripts/run_inference.py --text "I loved the movie!" "That was terrible."

Expected output:

📌 'I loved the movie!' → POSITIVE (label=1), score=0.8857
📌 'That was terrible.' → NEGATIVE (label=0), score=0.5499

🧠 What Each File Does

`scripts/train_model.py` (📚 Detailed Documentation)

Hugging Face:
- AutoTokenizer to tokenize the text.
- AutoModelForSequenceClassification to load the pre-trained DistilBERT model.
- Trainer and TrainingArguments to handle training logic.
PyTorch:
- Used behind the scenes by Trainer to handle model parameters, backpropagation, and GPU execution.

`scripts/run_inference.py` (📚 Detailed Documentation)

Hugging Face:
- Uses pipeline to wrap the trained model for easy inference.
PyTorch:
- Used internally when the pipeline calls the model on the GPU.

`scripts/collect_data.py` (📚 Detailed Documentation)

Purpose: Collect text data from external APIs (Reddit, Twitter, etc.)
Technologies: PRAW for Reddit API, requests for HTTP calls, pandas for data handling
Output: Raw data saved to data/raw/ for further processing

`scripts/preprocess_data.py` (📚 Detailed Documentation)

Purpose: Clean and label raw text data for training
Technologies: pandas, nltk/spacy for NLP, scikit-learn for data splitting
Output: Clean, labeled dataset ready for model training

`data/processed/labeled_dataset.csv`

CSV file with columns:
- text: the input sentence
- label: the corresponding sentiment label (1 = positive, 0 = negative)

`models/distilbert-finetuned/`

Created after training. Contains:
- pytorch_model.bin: the trained weights
- config.json: model architecture config
- tokenizer_config.json, vocab.txt, tokenizer.json: tokenizer files
- training_args.bin: the TrainingArguments used

� Detailed Documentation

For comprehensive guides on each component of this project, see the detailed documentation files:

📊 Data Collection Guide - Complete guide for collecting data from Reddit, Twitter, and other sources
🧹 Data Preprocessing Guide - In-depth data cleaning, labeling, and preparation techniques
🚀 Model Training Guide - Deep dive into PyTorch & Hugging Face integration for training
🎯 Inference & Deployment Guide - Production-ready inference patterns and deployment strategies

Each guide provides detailed explanations, code examples, best practices, and troubleshooting tips specific to PyTorch and Hugging Face workflows.

�🗺️ Next Steps to implement

Data Augmentation
- Explore techniques to generate more balanced or diverse training data.
Custom Datasets with datasets.Dataset
- Use Hugging Face's datasets library to load and manipulate large datasets.
Custom Models and Architectures
- Replace DistilBERT with another model (e.g., BERT, RoBERTa, or your own architecture).
Hyperparameter Optimization
- Try using Optuna or RayTune to find better training settings.
Metrics and Logging
- Integrate with wandb or tensorboard for live tracking.
Model Deployment
- Package the trained model as a FastAPI or Flask app.
- Export as ONNX or use torchscript for production.
Multi-class or Multi-label Classification
- Expand from binary to more complex tasks.
Explainability
- Use libraries like captum or shap to visualize which words influenced the predictions.

⚙️ Configuration Troubleshooting

Reddit API Issues

Problem: config.yaml file not found!

Solution: Create the config.yaml file in the project root directory with your Reddit API credentials

Problem: InvalidCredentials or Unauthorized errors

Solution:
1. Check that client_id and client_secret are correct
2. Ensure your Reddit app type is set to "script"
3. Verify your user_agent follows Reddit's format: appname/version by username

Problem: Too Many Requests or rate limiting

Solution:
1. Reduce the limit parameter in config.yaml
2. Add delays between requests
3. Ensure you're not running multiple instances simultaneously

Problem: Subreddit not found

Solution: Check that the subreddit name in config.yaml exists and is spelled correctly (no r/ prefix needed)

Data Collection Issues

Problem: Empty or very few comments collected

Solution:
1. Try different subreddits with more activity
2. Check if the subreddit allows public access
3. Increase the limit parameter for more data

Problem: Poor quality data (too many short comments)

Solution: Modify the filtering criteria in preprocess_data.py to be more selective

🧩 Problems I Faced and How I Solved Them

1. ✅ `return_all_scores` Deprecation Warning in `pipeline`

Issue: Warning appeared about return_all_scores being deprecated.

Fix: Removed the argument and relied on top_k=1 by default.

2. ✅ Unexpected Predictions (Only POSITIVE)

Issue: All inputs returned LABEL_1.

Fix: Trained the model longer (6 epochs) and confirmed class balance in dataset.

3. ✅ TypeError in `run_inference.py` (list indices must be integers)

Fix: Accessed the list of results correctly using result = classifier(text)[0]

4. ✅ `transformers` Version Conflicts

Fix: Uninstalled and reinstalled with:

pip install transformers==4.40.0

5. ✅ Label Mapping Mismatch

Fix: Mapped LABEL_1 to label=1 and LABEL_0 to label=0 explicitly.

6. ✅ NumPy `copy=False` Migration Error

Fix: Upgraded libraries:

pip install --upgrade numpy datasets transformers

7. ✅ GPU Usage Confirmation

Fix: Set device=0 in pipeline and printed torch.cuda.current_device().

📚 References and Documentation

Official Documentation

Project-Specific Guides

📊 Data Collection Guide - API integration and data sourcing
🧹 Data Preprocessing Guide - Text cleaning and labeling strategies
🚀 Model Training Guide - PyTorch & Hugging Face deep dive
🎯 Inference Guide - Deployment and prediction strategies

This project proves how simple it can be to build modern NLP applications with Hugging Face and PyTorch. Even with a basic CSV and minimal code, you're using state-of-the-art language models for real-world tasks.

Happy coding! 🤗🔥

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
.gitignore		.gitignore
README.md		README.md
config.yaml.example		config.yaml.example
requirements.txt		requirements.txt
workflow_diagram.svg		workflow_diagram.svg

marcojr/pytorch_huggingFace

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis with Hugging Face Transformers and PyTorch

🎯 Workflow Overview

📁 Project Structure

🚀 Quickstart

1. Prepare Environment

2. Setup Reddit API and Configuration

A. Create Reddit Developer Application

B. Create Configuration File

C. Test Your Configuration

3. Complete Data Pipeline

4. Run Inference

🧠 What Each File Does

scripts/train_model.py (📚 Detailed Documentation)

scripts/run_inference.py (📚 Detailed Documentation)

scripts/collect_data.py (📚 Detailed Documentation)

scripts/preprocess_data.py (📚 Detailed Documentation)

data/processed/labeled_dataset.csv

models/distilbert-finetuned/

� Detailed Documentation

�🗺️ Next Steps to implement

⚙️ Configuration Troubleshooting

Reddit API Issues

Data Collection Issues

🧩 Problems I Faced and How I Solved Them

1. ✅ return_all_scores Deprecation Warning in pipeline

2. ✅ Unexpected Predictions (Only POSITIVE)

3. ✅ TypeError in run_inference.py (list indices must be integers)

4. ✅ transformers Version Conflicts

5. ✅ Label Mapping Mismatch

6. ✅ NumPy copy=False Migration Error

7. ✅ GPU Usage Confirmation

📚 References and Documentation

Official Documentation

Project-Specific Guides

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages