# Fake Review Detection - Proof of Concept
### Revision: 6 | Last Updated: 2025-07-14 13:54

This notebook demonstrates a simple proof of concept for fake review detection using a pretrained transformer model. We'll use a small dataset of reviews to showcase the basic functionality.

## 1. Setup and Dependencies

This notebook is designed to run on SageMaker Distribution 3.2.0, which comes with most required packages pre-installed.

In [None]:
# Verify NLTK resources are available
import nltk

# Check if resources exist, download only if needed
try:
    nltk.data.find('tokenizers/punkt')
    print("NLTK punkt tokenizer is available")
except LookupError:
    print("Downloading NLTK punkt tokenizer...")
    nltk.download('punkt')

try:
    nltk.data.find('corpora/stopwords')
    print("NLTK stopwords corpus is available")
except LookupError:
    print("Downloading NLTK stopwords corpus...")
    nltk.download('stopwords')
    
# Download punkt_tab resource which is specifically needed for tokenization
try:
    nltk.data.find('tokenizers/punkt_tab')
    print("NLTK punkt_tab resource is available")
except LookupError:
    print("Downloading NLTK punkt_tab resource...")
    nltk.download('punkt_tab')

In [None]:
# Import necessary libraries
import sys
import os
import numpy as np
import pandas as pd
import torch
import requests
import json
from datasets import load_dataset, Dataset
from sklearn.model_selection import train_test_split
from transformers import TrainingArguments, Trainer
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import matplotlib.pyplot as plt

# Check transformers version
import transformers
print(f"Transformers version: {transformers.__version__}")

# Add the src directory to the path
# Get the absolute path to ensure imports work correctly
notebook_dir = os.path.abspath('')
project_root = os.path.abspath(os.path.join(notebook_dir, '..'))
sys.path.append(project_root)

# Verify the path is correct
print(f"Project root directory: {project_root}")
print(f"Files in project root: {os.listdir(project_root)}")

# Import project modules
from src.preprocessing import preprocess_text
from src.model import load_pretrained_model, predict
from src.evaluation import evaluate_model, plot_confusion_matrix

## 2. Load and Prepare Data

For this PoC, we'll create a small dataset of reviews. We'll avoid using the amazon_reviews_multi dataset directly due to potential checksum issues.

In [None]:
# Option 1: Use a different dataset that's less likely to have checksum issues
try:
    # Try loading IMDB dataset (more stable and commonly used)
    dataset = load_dataset("imdb", split="train[:1000]")
    print("Successfully loaded IMDB dataset")
    
    # Map the IMDB dataset structure to match our expected format
    # IMDB has 'text' and 'label' fields (0=negative, 1=positive)
    # We'll map this to 'review_body' and 'label' (0=genuine, 1=fake)
    # For demonstration, we'll consider positive reviews as genuine and negative as fake
    dataset = dataset.rename_column("text", "review_body")
    
except Exception as e:
    print(f"Error loading IMDB dataset: {e}")
    print("Falling back to creating a sample dataset manually")
    
    # Option 2: Create a small sample dataset manually
    reviews = [
        {"review_body": "This product is amazing! I've been using it for a month and it has completely changed my life. The quality is outstanding and it's worth every penny.", "label": 0},
        {"review_body": "I bought this yesterday and it's already broken. Terrible quality and customer service didn't help at all. Complete waste of money.", "label": 1},
        {"review_body": "Best purchase ever!!! Five stars!!! Amazing product!!! Buy it now!!! You won't regret it!!!", "label": 1},
        {"review_body": "The product arrived on time and works as expected. Good value for the price. I would recommend it to others looking for a budget option.", "label": 0},
        {"review_body": "I've had this product for about 6 months now and it's holding up well. No complaints and does exactly what it's supposed to do.", "label": 0},
        {"review_body": "DO NOT BUY THIS!!! It's a complete scam! The seller is dishonest and the product is nothing like described!!!", "label": 1},
        {"review_body": "Average product, nothing special but gets the job done. Packaging was nice and delivery was quick.", "label": 0},
        {"review_body": "I was skeptical at first but this product exceeded my expectations. The design is elegant and functionality is top-notch.", "label": 0},
        {"review_body": "This changed my life!!! I can't believe how amazing this is!!! Everyone needs to buy this right now!!!", "label": 1},
        {"review_body": "Product arrived damaged and when I tried to return it, customer service was unhelpful. Very disappointed with this purchase.", "label": 1}
    ]
    
    # Create more synthetic examples
    import random
    genuine_phrases = [
        "works well", "good quality", "as described", "reasonable price", "satisfied with purchase",
        "would recommend", "good value", "fast shipping", "easy to use", "well made"
    ]
    
    fake_phrases = [
        "amazing!!!", "life changing!!!", "best ever!!!", "miracle product!!!", "can't believe it!!!",
        "worst product ever", "complete scam", "total ripoff", "don't waste your money", "absolutely terrible"
    ]
    
    # Generate additional examples
    for _ in range(90):
        # Generate genuine reviews (with 0-2 exclamation points max)
        genuine = "I purchased this product last month. " + random.choice(genuine_phrases).capitalize() + \
                 ". " + random.choice(genuine_phrases).capitalize() + \
                 ". Overall " + random.choice(genuine_phrases) + "."
        genuine = genuine.replace("!!!", ".") + ("!" * random.randint(0, 2))
        
        # Generate fake reviews (with lots of exclamation points)
        fake = random.choice(fake_phrases).upper() + "!!! " + \
               random.choice(fake_phrases).capitalize() + "!!! " + \
               random.choice(fake_phrases).capitalize() + "!!!"
        
        reviews.append({"review_body": genuine, "label": 0})
        reviews.append({"review_body": fake, "label": 1})
    
    # Convert to Dataset
    sample_df = pd.DataFrame(reviews)
    dataset = Dataset.from_pandas(sample_df)

# Display dataset info
dataset

In [None]:
# Convert to pandas for easier manipulation
df = pd.DataFrame(dataset)
df.head()

In [None]:
# Check class distribution
print("Class distribution:")
print(df['label'].value_counts())

# Preprocess the review text
df['processed_review'] = df['review_body'].apply(preprocess_text)

# Split into train and test sets
train_df, test_df = train_test_split(df, test_size=0.2, stratify=df['label'], random_state=42)

print(f"Training set size: {len(train_df)}")
print(f"Test set size: {len(test_df)}")

## 3. Load Pretrained Model

We'll use a lightweight pretrained model for this PoC.

In [None]:
# Load pretrained model and tokenizer
model_name = "distilbert-base-uncased"
model, tokenizer = load_pretrained_model(model_name, num_labels=2)

print(f"Loaded model: {model_name}")

## 4. Prepare Dataset for Fine-tuning

In [None]:
# Tokenize the datasets
def tokenize_function(examples):
    return tokenizer(examples, padding="max_length", truncation=True, max_length=128)

# Prepare training dataset
train_texts = train_df['processed_review'].tolist()
train_labels = train_df['label'].tolist()
train_encodings = tokenize_function(train_texts)

# Prepare test dataset
test_texts = test_df['processed_review'].tolist()
test_labels = test_df['label'].tolist()
test_encodings = tokenize_function(test_texts)

# Create PyTorch datasets
class ReviewDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = ReviewDataset(train_encodings, train_labels)
test_dataset = ReviewDataset(test_encodings, test_labels)

## 5. Fine-tune the Model

We'll fine-tune the pretrained model on our dataset. This section is split into smaller steps to help isolate any issues.

In [None]:
# Step 5.1: Set up directories
# Use absolute paths to ensure files are saved in the correct location
output_dir = os.path.join(project_root, 'models', 'fake_review_detector')
logging_dir = os.path.join(project_root, 'results', 'logs')

# Create directories if they don't exist
os.makedirs(output_dir, exist_ok=True)
os.makedirs(logging_dir, exist_ok=True)

print(f"Output directory: {output_dir}")
print(f"Logging directory: {logging_dir}")

In [None]:
# Step 5.2: Configure training arguments
# Using parameters compatible with the installed version of Transformers
training_args = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    warmup_steps=100,
    weight_decay=0.01,
    logging_dir=logging_dir,
    logging_steps=10,
    # For older versions of Transformers, use eval_strategy instead of evaluation_strategy
    eval_strategy="steps",  # Match with save_strategy
    eval_steps=100,
    save_strategy="steps",  # Explicitly set save strategy
    save_steps=100,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss"  # Specify which metric to use for best model
)

print("Training arguments configured successfully")

In [None]:
# Step 5.3: Define evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return evaluate_model(labels, predictions)

print("Evaluation metrics defined")

In [None]:
# Step 5.4: Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics,
)

print("Trainer initialized successfully")

In [None]:
# Step 5.5: Fine-tune the model
# This step might be memory-intensive and could cause kernel crashes
# Try reducing batch size or model size if it fails
try:
    print("Starting model fine-tuning...")
    trainer.train()
    print("Model fine-tuning completed successfully")
except Exception as e:
    print(f"Error during training: {e}")
    print("\nTroubleshooting tips:")
    print("1. Try reducing batch size (per_device_train_batch_size)")
    print("2. Use a smaller model like 'distilbert-base-uncased'")
    print("3. Reduce the maximum sequence length in tokenization")
    print("4. Ensure you have enough memory available")

## 6. Evaluate the Model

In [None]:
# Evaluate the model on the test set
eval_results = trainer.evaluate()
print("Evaluation results:")
for key, value in eval_results.items():
    print(f"{key}: {value:.4f}")

In [None]:
# Make predictions on the test set
raw_predictions = trainer.predict(test_dataset)
y_pred = np.argmax(raw_predictions.predictions, axis=-1)
y_true = test_labels

# Plot confusion matrix
cm_plot = plot_confusion_matrix(y_true, y_pred, labels=['Genuine', 'Fake'])
cm_plot.show()

## 7. Test with Sample Reviews

Let's test our model with some sample reviews.

In [None]:
# Sample reviews for testing
sample_reviews = [
    "This product is amazing! I've been using it for a month and it has completely changed my life. Highly recommend!",
    "I bought this yesterday and it's already broken. Terrible quality and customer service didn't help at all.",
    "Best purchase ever!!! Five stars!!! Amazing product!!! Buy it now!!!",
    "The product arrived on time and works as expected. Good value for the price."
]

# Preprocess the samples
processed_samples = [preprocess_text(review) for review in sample_reviews]

# Make predictions
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

results = []
for i, (original, processed) in enumerate(zip(sample_reviews, processed_samples)):
    pred_class, confidence = predict(model, tokenizer, processed, device)
    results.append({
        "review": original,
        "prediction": "Fake" if pred_class == 1 else "Genuine",
        "confidence": confidence
    })

# Display results
for i, result in enumerate(results):
    print(f"Sample {i+1}:")
    print(f"Review: {result['review']}")
    print(f"Prediction: {result['prediction']} (Confidence: {result['confidence']:.4f})")
    print("---")

## 8. Save the Model for Deployment

In [None]:
# Save the model and tokenizer
model_save_path = os.path.join(project_root, 'models', 'fake_review_detector_final')
os.makedirs(model_save_path, exist_ok=True)

model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)

print(f"Model and tokenizer saved to {model_save_path}")

## 9. Next Steps for Multi-lingual Capabilities

This notebook demonstrates a basic proof of concept for fake review detection. To expand to multi-lingual capabilities, consider these next steps:

1. **Use a multi-lingual model**: Replace DistilBERT with XLM-RoBERTa or mBERT
   ```python
   model_name = "xlm-roberta-base"  # Supports 100+ languages
   # or
   model_name = "bert-base-multilingual-cased"  # Supports 104 languages
   ```

2. **Collect multi-lingual training data**: Gather labeled fake/genuine reviews in multiple languages

3. **Evaluate per language**: Set up separate evaluation for each language to ensure consistent performance

4. **Consider language-specific fine-tuning**: Fine-tune on each language separately or use a mixed approach

5. **Implement language detection**: Add a preprocessing step to detect the language of incoming reviews

6. **Deploy as a SageMaker endpoint**: Create a real-time inference endpoint for production use

7. **Implement monitoring**: Set up model monitoring to detect performance drift over time