# Technical Assignment - Week 2
- GitHub repo: https://github.com/jliima/movie-review-sentiment-analyzer
- Demo video: https://www.youtube.com/watch?v=Q_YuPJ5Y2OU
- Hugging Face Model: https://huggingface.co/hieroja/custom-sentiment-model

## Part 1: Dataset Preparation and Fine-Tuning

### Step 1: Download the IMDB Dataset
1. Use the IMDB dataset from Kaggle: ```/kaggle/input/imdb-dataset/IMDB Dataset.csv.```
2. Load the dataset using Pandas and verify it in your notebook.

In [1]:
import pandas as pd

!kaggle datasets download -d 'lakshmi25npathi/imdb-dataset-of-50k-movie-reviews'
!unzip -n 'imdb-dataset-of-50k-movie-reviews.zip' -d './data/'

df = pd.read_csv('./data/IMDB Dataset.csv')

df.head()

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)
Archive:  imdb-dataset-of-50k-movie-reviews.zip


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


### Step 2: Data Preprocessing
1. Clean and preprocess the dataset:
    - Encode the sentiment column (positive -> 1, negative -> 0).
    - Retain only the review and label columns.
2. Split the data into training and validation, testing

In [2]:
from sklearn.model_selection import train_test_split

dfProcessed = df.copy()
dfProcessed['sentiment'] = dfProcessed['sentiment'].map({'positive': 1, 'negative': 0})
dfProcessed = dfProcessed[['review', 'sentiment']]

trainDf, tempDf = train_test_split(dfProcessed, test_size=0.2, random_state=42, stratify=dfProcessed['sentiment'])

testPositive = tempDf[tempDf['sentiment'] == 1].sample(n=1, random_state=42)
testNegative = tempDf[tempDf['sentiment'] == 0].sample(n=1, random_state=42)
testDf = pd.concat([testPositive, testNegative])

valDf = tempDf.drop(testDf.index)

print('trainDf.shape:', trainDf.shape)
print('valDf.shape:', valDf.shape)
print('testDf.shape:', testDf.shape)

trainDf.shape: (40000, 2)
valDf.shape: (9998, 2)
testDf.shape: (2, 2)


### Step 3: Model Selection and Tokenization
1. Select a pre-trained Hugging Face transformer model for fine-tuning (e.g., distilbert-base-uncased).
2. Tokenize the dataset with (see if required )
    - Truncation.
    - Padding.
    - Maximum sequence length of 256.

In [3]:
from transformers import AutoTokenizer
from datasets import Dataset

tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

def tokenizeFunction(example):
  return tokenizer(example['review'], padding='max_length', truncation=True, max_length=256)

trainDataset = Dataset.from_pandas(trainDf)
valDataset = Dataset.from_pandas(valDf)
testDataset = Dataset.from_pandas(testDf)

trainDataset = trainDataset.map(tokenizeFunction, batched=True)
valDataset = valDataset.map(tokenizeFunction, batched=True)
testDataset = testDataset.map(tokenizeFunction, batched=True)

trainDataset = trainDataset.rename_column('sentiment', 'labels')
valDataset = valDataset.rename_column('sentiment', 'labels')
testDataset = testDataset.rename_column('sentiment', 'labels')

Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/9998 [00:00<?, ? examples/s]

Map:   0%|          | 0/2 [00:00<?, ? examples/s]

### Step 4: Fine-Tune the Model
1. Fine-tune the model on the IMDB dataset for 2 epochs using the Hugging Face Trainer.
2. Set training parameters:
    - Learning rate: 5e-5 or your own
    - Batch size: 16 or 32
    - Evaluation at the end of each epoch.
3. Ensure that metrics like accuracy, precision, recall, and F1-score are logged during training.

In [4]:
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

def computeMetrics(evalPred):
  logits, labels = evalPred
  predictions = np.argmax(logits, axis=-1)

  accuracy = accuracy_score(labels, predictions)
  precision = precision_score(labels, predictions)
  recall = recall_score(labels, predictions)
  f1 = f1_score(labels, predictions)

  return {
    'accuracy': accuracy, 
    'precision': precision, 
    'recall': recall, 
    'f1': f1
  }

model = AutoModelForSequenceClassification.from_pretrained(
  'distilbert-base-uncased', 
  num_labels=2
)

trainingArgs = TrainingArguments(
  output_dir='./results',
  num_train_epochs=2,
  per_device_train_batch_size=16,
  per_device_eval_batch_size=16,
  learning_rate=5e-5, #5e-5
  eval_strategy='epoch',
  save_strategy='epoch',
  logging_strategy='epoch',
  logging_dir='./logs'
)

trainer = Trainer(
  model=model,
  args=trainingArgs,
  train_dataset=trainDataset,
  eval_dataset=valDataset,
  compute_metrics=computeMetrics
)

trainer.train(resume_from_checkpoint=False)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[2025-02-11 23:07:29,952] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)


/usr/bin/ld: cannot find -laio: No such file or directory
collect2: error: ld returned 1 exit status


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.283,0.221941,0.915283,0.903734,0.929586,0.916478
2,0.1437,0.269158,0.919984,0.917645,0.922785,0.920207


TrainOutput(global_step=5000, training_loss=0.21331258544921874, metrics={'train_runtime': 627.0555, 'train_samples_per_second': 127.58, 'train_steps_per_second': 7.974, 'total_flos': 5298695946240000.0, 'train_loss': 0.21331258544921874, 'epoch': 2.0})

### Step 5: Save and Upload the Model to Hugging Face
1. Save the fine-tuned model and tokenizer locally using save_pretrained().
2. Log in to Hugging Face using notebook_login.
3. Upload the model to Hugging Face using push_to_hub.
4. Verify the model on Hugging Face Hub and include the link in your notebook.

In [5]:
import os
from dotenv import load_dotenv

load_dotenv()

hfToken = os.getenv('HF_API_KEY')
hfUsername = os.getenv('HF_USERNAME')

model.save_pretrained('./custom-sentiment-model')
tokenizer.save_pretrained('./custom-sentiment-model')

model.push_to_hub(f'{hfUsername}/custom-sentiment-model', token=hfToken)
tokenizer.push_to_hub(f'{hfUsername}/custom-sentiment-model', token=hfToken)

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/hieroja/custom-sentiment-model/commit/f6b6a101e34329a510b9adbed581c5edfa36c8e8', commit_message='Upload tokenizer', commit_description='', oid='f6b6a101e34329a510b9adbed581c5edfa36c8e8', pr_url=None, repo_url=RepoUrl('https://huggingface.co/hieroja/custom-sentiment-model', endpoint='https://huggingface.co', repo_type='model', repo_id='hieroja/custom-sentiment-model'), pr_revision=None, pr_num=None)

## Part 2: API Development and Testing

### Steps 6 7 and 9:
#### **<span style="color:#00FF00">SEE main.py for implementations!</span>**

#### Step 6: Set Up the Backend API
1. Use FastAPI or Flask, Express, Nest Nodejs to create an API.
2. Define a POST endpoint (```/analyze/```) that:
    - Accepts:
      - ```text```: The input text for sentiment analysis.
      - ```model```: A parameter specifying the model to use (```custom``` or ```llama```).
    - Returns:
      - Sentiment (```positive``` or ```negative```).
      - Confidence score.
#### Step 7: Load Models
1. Load the fine-tuned model from Hugging Face.
2. Access the Llama 3 model using the Groq Cloud API.

#### Step 9: Define the Llama 3 Prompt (1 point)
1. Write a clear and reusable prompt for the Llama 3 model in Groq Cloud.

  Example: can be improved more

  "Classify the sentiment of this text as positive or negative:
  'This movie was fantastic'



In [None]:
import os
import time
import requests
import subprocess
import psutil

API_URL = 'http://127.0.0.1:8000/analyze/'
SERVER_CMD = ['python', './backend/main.py']

def killExistingServer():
  for proc in psutil.process_iter(attrs=['pid', 'cmdline']):
    if proc.info['cmdline'] and 'main.py' in proc.info['cmdline']:
      print(f"Stopping existing server (PID: {proc.info['pid']})...")
      proc.kill()
      time.sleep(1)

def isServerRunning():
  try:
    response = requests.get(API_URL.replace('/analyze/', '/docs'))
    return response.status_code == 200
  except requests.ConnectionError:
    return False

def startServer():
  print('Starting API server...')
  subprocess.Popen(SERVER_CMD, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
  time.sleep(2)

  for _ in range(10):
    if isServerRunning():
      print('API server up and running!')
      return
    time.sleep(1)

  raise RuntimeError('Failed to start API server.')


if isServerRunning():
  print('API server is already running')
else:
  startServer()


API server is already running


### Steps 8 and 10: 
#### Step 8: Test the API Locally
1. Test the /analyze/ endpoint with both models (custom and llama) using:
    - Postman.
    - curl.
    - Python requests.
#### Step 10: Test with Both Models
1. Verify that the API works for both the fine-tuned model and the Llama 3 model.
2. Ensure the results return the sentiment score too.
3. For Groq you can add into prompt.

In [8]:
import requests

labelMapping = {1: 'positive', 0: 'negative'}



for i in range(2):
  example = testDataset[i]
  reviewText = example['review']
  expectedSentiment = labelMapping[example['labels']]
  
  for model in ['custom', 'llama']:

    response = requests.post(
      url='http://localhost:8000/analyze/', 
      json={
        'text': reviewText, 
        'model': model
      }
    )

    result = response.json()

    sentiment = result.get('sentiment')
    confidence = result.get('confidence')
    
    print('Test case (model: {}):'.format(model))
    print('Review text: {}\n'.format(reviewText))
    print('Expected sentiment: {}'.format(expectedSentiment))
    print('Response sentiment: {} (confidence {})\n---------------------------------------------------------------\n'.format(sentiment, confidence))

Test case (model: custom):
Review text: Many believe this movie is a baseball movie. Such people are disappointed because it's about a baseball player, but the movie isn't about baseball.<br /><br />Some think this movie is a romantic comedy and are disappointed because the relationship isn't really developed. This movie is not a romantic comedy.<br /><br />This movie is about culture. An arrogant American Major Leaguer and a stern traditional Japanese baseball manager cannot succeed because they can't, indeed, won't understand one another. It's after they manage to break through the cultural barrier that they have success. The ballplayer becomes more Japanese in his team mentality and the manager more American in allowing individual achievement, and they meet in the middle.<br /><br />Baseball and the romance is subordinate to this critique of the two cultures. Many who have no understanding of the Japanese mindset miss this and think it's a movie on baseball or romance and see the cu