Skip to content

marcojr/pytorch_huggingFace

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Sentiment Analysis with Hugging Face Transformers and PyTorch

This repository provides a minimal yet realistic example of using Hugging Face Transformers with PyTorch to train and deploy a sentiment analysis model.

The project demonstrates how to:

  • Fine-tune a pre-trained language model (DistilBERT) on a binary sentiment classification task.
  • Use Trainer and TrainingArguments from Hugging Face to abstract away boilerplate code.
  • Leverage PyTorch as the training backend.
  • Run inference using a simple script with the Hugging Face pipeline.

This is an ideal template for beginners who want to understand how Hugging Face and PyTorch work together.


🎯 Workflow Overview

This project implements a complete end-to-end machine learning pipeline, from data collection to model deployment. The diagram below illustrates the entire workflow and architecture:

Sentiment Analysis Workflow

Key Pipeline Stages:

  1. Data Collection: Gather real-world text data from Reddit using PRAW API
  2. Data Preprocessing: Clean text and generate pseudo-labels using pre-trained models
  3. Model Training: Fine-tune DistilBERT using PyTorch and Hugging Face Trainer
  4. Inference: Deploy the trained model for sentiment prediction

Architecture Layers:

  • Application Layer: Your Python scripts for each pipeline stage
  • Hugging Face Layer: High-level ML abstractions (AutoTokenizer, Trainer, Pipeline)
  • PyTorch Layer: Core deep learning operations (tensors, autograd, optimization)
  • Hardware Layer: CPU/GPU processing and memory management

This layered approach demonstrates how modern NLP workflows integrate multiple frameworks to achieve both ease of use and powerful functionality.


πŸ“ Project Structure

β”œβ”€β”€ data/
β”‚   └── processed/
β”‚       └── labeled_dataset.csv     # Cleaned dataset with `text` and `label` columns
β”œβ”€β”€ models/
β”‚   └── distilbert-finetuned/       # Output directory created by Trainer; contains trained model
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ collect_data.py             # Data collection script ([πŸ“– Documentation](scripts/collect_data.md))
β”‚   β”œβ”€β”€ preprocess_data.py          # Data cleaning and formatting script ([πŸ“– Documentation](scripts/preprocess_data.md))
β”‚   β”œβ”€β”€ train_model.py              # Script to train the model ([πŸ“– Documentation](scripts/train_model.md))
β”‚   └── run_inference.py           # Script to perform inference on sample texts ([πŸ“– Documentation](scripts/run_inference.md))
β”œβ”€β”€ requirements.txt                # Python dependencies
└── README.md                       # This file

πŸš€ Quickstart

1. Prepare Environment

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows
pip install -r requirements.txt

2. Setup Reddit API and Configuration

Before collecting data, you need to create a Reddit developer application and configure your API credentials.

A. Create Reddit Developer Application

  1. Go to Reddit Apps: Visit https://www.reddit.com/prefs/apps
  2. Login: Sign in with your Reddit account (create one if needed)
  3. Create New App: Click "Create App" or "Create Another App"
  4. Fill Application Details:
    • Name: sentiment-analysis-bot (or any name you prefer)
    • App type: Select "script" (for personal use scripts)
    • Description: Data collection for sentiment analysis research
    • About URL: Leave blank or add your GitHub repo
    • Redirect URI: http://localhost:8080 (required but not used for scripts)
  5. Create App: Click "Create app"
  6. Get Credentials: Note down your client_id (under the app name) and client_secret

B. Create Configuration File

Create a config.yaml file in the project root with your Reddit API credentials. You can use the provided template:

# Copy the example configuration file
cp config.yaml.example config.yaml

# Edit with your credentials
# Use your preferred text editor (notepad, VSCode, nano, etc.)
notepad config.yaml  # Windows
# or
code config.yaml     # VSCode

Configuration Template (config.yaml):

reddit:
  client_id: "your_client_id_here"          # Found under your app name
  client_secret: "your_client_secret_here"  # The "secret" field
  user_agent: "sentiment-analysis-bot/1.0 by YourUsername"  # Identify your bot
  subreddit: "technology"                   # Subreddit to collect from
  limit: 100                               # Number of comments to collect
  output_path: "data/raw/reddit_comments.json"  # Where to save raw data

How to fill in the values:

  1. client_id: Copy from Reddit app page (appears under the app name)
  2. client_secret: Copy the "secret" field from Reddit app page
  3. user_agent: Replace YourUsername with your Reddit username
  4. subreddit: Choose any public subreddit (try "technology", "movies", "gaming")
  5. limit: Start with 50-100 for testing, increase later for more data

Important Security Notes:

  • βœ… The config.yaml file is already in .gitignore to prevent accidental commits
  • βœ… Never share your client_secret publicly
  • βœ… Use descriptive but respectful user_agent strings
  • βœ… Replace YourUsername with your actual Reddit username

C. Test Your Configuration

# Test data collection (collect a small sample first)
python scripts/collect_data.py

3. Complete Data Pipeline

# Step 1: Collect raw data from Reddit
python scripts/collect_data.py

# Step 2: Clean and label the collected data
python scripts/preprocess_data.py

# Step 3: Train the sentiment analysis model
python scripts/train_model.py

This will:

  • Collect comments from Reddit using your API credentials
  • Clean and automatically label the text data
  • Load the dataset from data/processed/labeled_dataset.csv
  • Tokenize the texts
  • Fine-tune distilbert-base-uncased
  • Save the model to models/distilbert-finetuned

4. Run Inference

python scripts/run_inference.py --text "I loved the movie!" "That was terrible."

Expected output:

πŸ“Œ 'I loved the movie!' β†’ POSITIVE (label=1), score=0.8857
πŸ“Œ 'That was terrible.' β†’ NEGATIVE (label=0), score=0.5499

🧠 What Each File Does

scripts/train_model.py (πŸ“š Detailed Documentation)

  • Hugging Face:

    • AutoTokenizer to tokenize the text.
    • AutoModelForSequenceClassification to load the pre-trained DistilBERT model.
    • Trainer and TrainingArguments to handle training logic.
  • PyTorch:

    • Used behind the scenes by Trainer to handle model parameters, backpropagation, and GPU execution.

scripts/run_inference.py (πŸ“š Detailed Documentation)

  • Hugging Face:

    • Uses pipeline to wrap the trained model for easy inference.
  • PyTorch:

    • Used internally when the pipeline calls the model on the GPU.

scripts/collect_data.py (πŸ“š Detailed Documentation)

  • Purpose: Collect text data from external APIs (Reddit, Twitter, etc.)
  • Technologies: PRAW for Reddit API, requests for HTTP calls, pandas for data handling
  • Output: Raw data saved to data/raw/ for further processing

scripts/preprocess_data.py (πŸ“š Detailed Documentation)

  • Purpose: Clean and label raw text data for training
  • Technologies: pandas, nltk/spacy for NLP, scikit-learn for data splitting
  • Output: Clean, labeled dataset ready for model training

data/processed/labeled_dataset.csv

  • CSV file with columns:

    • text: the input sentence
    • label: the corresponding sentiment label (1 = positive, 0 = negative)

models/distilbert-finetuned/

  • Created after training. Contains:

    • pytorch_model.bin: the trained weights
    • config.json: model architecture config
    • tokenizer_config.json, vocab.txt, tokenizer.json: tokenizer files
    • training_args.bin: the TrainingArguments used

οΏ½ Detailed Documentation

For comprehensive guides on each component of this project, see the detailed documentation files:

Each guide provides detailed explanations, code examples, best practices, and troubleshooting tips specific to PyTorch and Hugging Face workflows.


οΏ½πŸ—ΊοΈ Next Steps to implement

  1. Data Augmentation

    • Explore techniques to generate more balanced or diverse training data.
  2. Custom Datasets with datasets.Dataset

    • Use Hugging Face's datasets library to load and manipulate large datasets.
  3. Custom Models and Architectures

    • Replace DistilBERT with another model (e.g., BERT, RoBERTa, or your own architecture).
  4. Hyperparameter Optimization

    • Try using Optuna or RayTune to find better training settings.
  5. Metrics and Logging

    • Integrate with wandb or tensorboard for live tracking.
  6. Model Deployment

    • Package the trained model as a FastAPI or Flask app.
    • Export as ONNX or use torchscript for production.
  7. Multi-class or Multi-label Classification

    • Expand from binary to more complex tasks.
  8. Explainability

    • Use libraries like captum or shap to visualize which words influenced the predictions.

βš™οΈ Configuration Troubleshooting

Reddit API Issues

Problem: config.yaml file not found!

  • Solution: Create the config.yaml file in the project root directory with your Reddit API credentials

Problem: InvalidCredentials or Unauthorized errors

  • Solution:
    1. Check that client_id and client_secret are correct
    2. Ensure your Reddit app type is set to "script"
    3. Verify your user_agent follows Reddit's format: appname/version by username

Problem: Too Many Requests or rate limiting

  • Solution:
    1. Reduce the limit parameter in config.yaml
    2. Add delays between requests
    3. Ensure you're not running multiple instances simultaneously

Problem: Subreddit not found

  • Solution: Check that the subreddit name in config.yaml exists and is spelled correctly (no r/ prefix needed)

Data Collection Issues

Problem: Empty or very few comments collected

  • Solution:
    1. Try different subreddits with more activity
    2. Check if the subreddit allows public access
    3. Increase the limit parameter for more data

Problem: Poor quality data (too many short comments)

  • Solution: Modify the filtering criteria in preprocess_data.py to be more selective

🧩 Problems I Faced and How I Solved Them

1. βœ… return_all_scores Deprecation Warning in pipeline

Issue: Warning appeared about return_all_scores being deprecated.

Fix: Removed the argument and relied on top_k=1 by default.


2. βœ… Unexpected Predictions (Only POSITIVE)

Issue: All inputs returned LABEL_1.

Fix: Trained the model longer (6 epochs) and confirmed class balance in dataset.


3. βœ… TypeError in run_inference.py (list indices must be integers)

Fix: Accessed the list of results correctly using result = classifier(text)[0]


4. βœ… transformers Version Conflicts

Fix: Uninstalled and reinstalled with:

pip install transformers==4.40.0

5. βœ… Label Mapping Mismatch

Fix: Mapped LABEL_1 to label=1 and LABEL_0 to label=0 explicitly.


6. βœ… NumPy copy=False Migration Error

Fix: Upgraded libraries:

pip install --upgrade numpy datasets transformers

7. βœ… GPU Usage Confirmation

Fix: Set device=0 in pipeline and printed torch.cuda.current_device().


πŸ“š References and Documentation

Official Documentation

Project-Specific Guides


This project proves how simple it can be to build modern NLP applications with Hugging Face and PyTorch. Even with a basic CSV and minimal code, you're using state-of-the-art language models for real-world tasks.

Happy coding! πŸ€—πŸ”₯

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages