This repository provides a minimal yet realistic example of using Hugging Face Transformers with PyTorch to train and deploy a sentiment analysis model.
The project demonstrates how to:
- Fine-tune a pre-trained language model (DistilBERT) on a binary sentiment classification task.
- Use
TrainerandTrainingArgumentsfrom Hugging Face to abstract away boilerplate code. - Leverage PyTorch as the training backend.
- Run inference using a simple script with the Hugging Face
pipeline.
This is an ideal template for beginners who want to understand how Hugging Face and PyTorch work together.
This project implements a complete end-to-end machine learning pipeline, from data collection to model deployment. The diagram below illustrates the entire workflow and architecture:
Key Pipeline Stages:
- Data Collection: Gather real-world text data from Reddit using PRAW API
- Data Preprocessing: Clean text and generate pseudo-labels using pre-trained models
- Model Training: Fine-tune DistilBERT using PyTorch and Hugging Face Trainer
- Inference: Deploy the trained model for sentiment prediction
Architecture Layers:
- Application Layer: Your Python scripts for each pipeline stage
- Hugging Face Layer: High-level ML abstractions (AutoTokenizer, Trainer, Pipeline)
- PyTorch Layer: Core deep learning operations (tensors, autograd, optimization)
- Hardware Layer: CPU/GPU processing and memory management
This layered approach demonstrates how modern NLP workflows integrate multiple frameworks to achieve both ease of use and powerful functionality.
βββ data/
β βββ processed/
β βββ labeled_dataset.csv # Cleaned dataset with `text` and `label` columns
βββ models/
β βββ distilbert-finetuned/ # Output directory created by Trainer; contains trained model
βββ scripts/
β βββ collect_data.py # Data collection script ([π Documentation](scripts/collect_data.md))
β βββ preprocess_data.py # Data cleaning and formatting script ([π Documentation](scripts/preprocess_data.md))
β βββ train_model.py # Script to train the model ([π Documentation](scripts/train_model.md))
β βββ run_inference.py # Script to perform inference on sample texts ([π Documentation](scripts/run_inference.md))
βββ requirements.txt # Python dependencies
βββ README.md # This file
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on Windows
pip install -r requirements.txtBefore collecting data, you need to create a Reddit developer application and configure your API credentials.
- Go to Reddit Apps: Visit https://www.reddit.com/prefs/apps
- Login: Sign in with your Reddit account (create one if needed)
- Create New App: Click "Create App" or "Create Another App"
- Fill Application Details:
- Name:
sentiment-analysis-bot(or any name you prefer) - App type: Select "script" (for personal use scripts)
- Description:
Data collection for sentiment analysis research - About URL: Leave blank or add your GitHub repo
- Redirect URI:
http://localhost:8080(required but not used for scripts)
- Name:
- Create App: Click "Create app"
- Get Credentials: Note down your
client_id(under the app name) andclient_secret
Create a config.yaml file in the project root with your Reddit API credentials. You can use the provided template:
# Copy the example configuration file
cp config.yaml.example config.yaml
# Edit with your credentials
# Use your preferred text editor (notepad, VSCode, nano, etc.)
notepad config.yaml # Windows
# or
code config.yaml # VSCodeConfiguration Template (config.yaml):
reddit:
client_id: "your_client_id_here" # Found under your app name
client_secret: "your_client_secret_here" # The "secret" field
user_agent: "sentiment-analysis-bot/1.0 by YourUsername" # Identify your bot
subreddit: "technology" # Subreddit to collect from
limit: 100 # Number of comments to collect
output_path: "data/raw/reddit_comments.json" # Where to save raw dataHow to fill in the values:
- client_id: Copy from Reddit app page (appears under the app name)
- client_secret: Copy the "secret" field from Reddit app page
- user_agent: Replace
YourUsernamewith your Reddit username - subreddit: Choose any public subreddit (try "technology", "movies", "gaming")
- limit: Start with 50-100 for testing, increase later for more data
Important Security Notes:
- β
The
config.yamlfile is already in.gitignoreto prevent accidental commits - β
Never share your
client_secretpublicly - β
Use descriptive but respectful
user_agentstrings - β
Replace
YourUsernamewith your actual Reddit username
# Test data collection (collect a small sample first)
python scripts/collect_data.py# Step 1: Collect raw data from Reddit
python scripts/collect_data.py
# Step 2: Clean and label the collected data
python scripts/preprocess_data.py
# Step 3: Train the sentiment analysis model
python scripts/train_model.pyThis will:
- Collect comments from Reddit using your API credentials
- Clean and automatically label the text data
- Load the dataset from
data/processed/labeled_dataset.csv - Tokenize the texts
- Fine-tune
distilbert-base-uncased - Save the model to
models/distilbert-finetuned
python scripts/run_inference.py --text "I loved the movie!" "That was terrible."Expected output:
π 'I loved the movie!' β POSITIVE (label=1), score=0.8857
π 'That was terrible.' β NEGATIVE (label=0), score=0.5499
scripts/train_model.py (π Detailed Documentation)
-
Hugging Face:
AutoTokenizerto tokenize the text.AutoModelForSequenceClassificationto load the pre-trained DistilBERT model.TrainerandTrainingArgumentsto handle training logic.
-
PyTorch:
- Used behind the scenes by
Trainerto handle model parameters, backpropagation, and GPU execution.
- Used behind the scenes by
scripts/run_inference.py (π Detailed Documentation)
-
Hugging Face:
- Uses
pipelineto wrap the trained model for easy inference.
- Uses
-
PyTorch:
- Used internally when the pipeline calls the model on the GPU.
scripts/collect_data.py (π Detailed Documentation)
- Purpose: Collect text data from external APIs (Reddit, Twitter, etc.)
- Technologies: PRAW for Reddit API, requests for HTTP calls, pandas for data handling
- Output: Raw data saved to
data/raw/for further processing
scripts/preprocess_data.py (π Detailed Documentation)
- Purpose: Clean and label raw text data for training
- Technologies: pandas, nltk/spacy for NLP, scikit-learn for data splitting
- Output: Clean, labeled dataset ready for model training
-
CSV file with columns:
text: the input sentencelabel: the corresponding sentiment label (1 = positive, 0 = negative)
-
Created after training. Contains:
pytorch_model.bin: the trained weightsconfig.json: model architecture configtokenizer_config.json,vocab.txt,tokenizer.json: tokenizer filestraining_args.bin: the TrainingArguments used
For comprehensive guides on each component of this project, see the detailed documentation files:
- π Data Collection Guide - Complete guide for collecting data from Reddit, Twitter, and other sources
- π§Ή Data Preprocessing Guide - In-depth data cleaning, labeling, and preparation techniques
- π Model Training Guide - Deep dive into PyTorch & Hugging Face integration for training
- π― Inference & Deployment Guide - Production-ready inference patterns and deployment strategies
Each guide provides detailed explanations, code examples, best practices, and troubleshooting tips specific to PyTorch and Hugging Face workflows.
-
Data Augmentation
- Explore techniques to generate more balanced or diverse training data.
-
Custom Datasets with
datasets.Dataset- Use Hugging Face's
datasetslibrary to load and manipulate large datasets.
- Use Hugging Face's
-
Custom Models and Architectures
- Replace DistilBERT with another model (e.g., BERT, RoBERTa, or your own architecture).
-
Hyperparameter Optimization
- Try using Optuna or RayTune to find better training settings.
-
Metrics and Logging
- Integrate with
wandbortensorboardfor live tracking.
- Integrate with
-
Model Deployment
- Package the trained model as a FastAPI or Flask app.
- Export as
ONNXor usetorchscriptfor production.
-
Multi-class or Multi-label Classification
- Expand from binary to more complex tasks.
-
Explainability
- Use libraries like
captumorshapto visualize which words influenced the predictions.
- Use libraries like
Problem: config.yaml file not found!
- Solution: Create the
config.yamlfile in the project root directory with your Reddit API credentials
Problem: InvalidCredentials or Unauthorized errors
- Solution:
- Check that
client_idandclient_secretare correct - Ensure your Reddit app type is set to "script"
- Verify your
user_agentfollows Reddit's format:appname/version by username
- Check that
Problem: Too Many Requests or rate limiting
- Solution:
- Reduce the
limitparameter inconfig.yaml - Add delays between requests
- Ensure you're not running multiple instances simultaneously
- Reduce the
Problem: Subreddit not found
- Solution: Check that the subreddit name in
config.yamlexists and is spelled correctly (no r/ prefix needed)
Problem: Empty or very few comments collected
- Solution:
- Try different subreddits with more activity
- Check if the subreddit allows public access
- Increase the
limitparameter for more data
Problem: Poor quality data (too many short comments)
- Solution: Modify the filtering criteria in
preprocess_data.pyto be more selective
Issue:
Warning appeared about return_all_scores being deprecated.
Fix:
Removed the argument and relied on top_k=1 by default.
Issue:
All inputs returned LABEL_1.
Fix: Trained the model longer (6 epochs) and confirmed class balance in dataset.
Fix:
Accessed the list of results correctly using result = classifier(text)[0]
Fix: Uninstalled and reinstalled with:
pip install transformers==4.40.0Fix:
Mapped LABEL_1 to label=1 and LABEL_0 to label=0 explicitly.
Fix: Upgraded libraries:
pip install --upgrade numpy datasets transformersFix:
Set device=0 in pipeline and printed torch.cuda.current_device().
- π Data Collection Guide - API integration and data sourcing
- π§Ή Data Preprocessing Guide - Text cleaning and labeling strategies
- π Model Training Guide - PyTorch & Hugging Face deep dive
- π― Inference Guide - Deployment and prediction strategies
This project proves how simple it can be to build modern NLP applications with Hugging Face and PyTorch. Even with a basic CSV and minimal code, you're using state-of-the-art language models for real-world tasks.
Happy coding! π€π₯