<a href="https://colab.research.google.com/github/joaowinderfeldbussolotto/deepseek-knowledge-distillation-sentiment-analysis/blob/main/deepseek_model_distillation_sentiment_analysis_dataset_annotation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Financial News Sentiment Analysis - Data Generation Pipeline

## Overview

This notebook implements a data generation pipeline for sentiment analysis of Brazilian Portuguese financial news. It uses the Mistral Large language model to annotate a dataset that will later be used to fine-tune smaller models like BERT base.

## Pipeline Components

### 1. Dependencies and Setup

In [1]:
!pip install -q -U datasets

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/480.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m471.0/480.6 kB[0m [31m14.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fol

In [2]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


## 2. Data Loading and Preprocessing

We'll use a translated version of the Financial Phrase Bank dataset, which contains:
- Financial news texts in Brazilian Portuguese
- Sentiment labels (positive, negative, neutral)
- Human-annotated classifications

In [3]:
!kaggle datasets download mateuspicanco/financial-phrase-bank-portuguese-translation
!unzip financial-phrase-bank-portuguese-translation

Dataset URL: https://www.kaggle.com/datasets/mateuspicanco/financial-phrase-bank-portuguese-translation
License(s): CC-BY-SA-3.0
Downloading financial-phrase-bank-portuguese-translation.zip to /content
  0% 0.00/470k [00:00<?, ?B/s]
100% 470k/470k [00:00<00:00, 31.3MB/s]
Archive:  financial-phrase-bank-portuguese-translation.zip
  inflating: financial_phrase_bank_pt_br.csv  


In [4]:
import pandas as pd

stock_market_data_path = 'financial_phrase_bank_pt_br.csv'

df = pd.read_csv(stock_market_data_path)
df.head()

Unnamed: 0,y,text,text_pt
0,neutral,Technopolis plans to develop in stages an area...,A Technopolis planeja desenvolver em etapas um...
1,negative,The international electronic industry company ...,"A Elcoteq, empresa internacional da indústria ..."
2,positive,With the new production plant the company woul...,Com a nova planta de produção a empresa aument...
3,positive,According to the company 's updated strategy f...,De acordo com a estratégia atualizada da empre...
4,positive,FINANCING OF ASPOCOMP 'S GROWTH Aspocomp is ag...,FINANCIAMENTO DO CRESCIMENTO DA ASPOCOMP A Asp...


In [5]:
df['label_text'] = df['y']

In [6]:
label_counts = df['label_text'].value_counts()
label_percentages = df['label_text'].value_counts(normalize=True) * 100

print("Label Counts:\n", label_counts)
print("\nLabel Percentages:\n", label_percentages)
print("\nTotal Count:", label_counts.sum())

Label Counts:
 label_text
neutral     2878
positive    1363
negative     604
Name: count, dtype: int64

Label Percentages:
 label_text
neutral     59.401445
positive    28.132095
negative    12.466460
Name: proportion, dtype: float64

Total Count: 4845


## 3. Data Splitting Strategy

Our data splitting approach aims to create:
- A balanced training set with 3000 samples
- A test set with 100 samples per sentiment class
- Maintained class distribution for robust evaluation

Note: The commented code shows an alternative splitting strategy that was considered.

In [7]:
from sklearn.model_selection import train_test_split

# Constants
TOTAL_TRAIN_TARGET = 3000
TEST_SIZE = 100  # per sentiment

X_train = list()
X_test = list()

# Calculate samples needed per sentiment (except negative)
samples_per_other_sentiment = (TOTAL_TRAIN_TARGET - 504) // 2  # Dividing remaining samples between positive and neutral

for sentiment in ["positive", "negative", "neutral"]:
    sentiment_data = df[df.label_text == sentiment]

    if sentiment == "negative":
        train_size = len(sentiment_data) - TEST_SIZE
    else:
        train_size = samples_per_other_sentiment

    train, test = train_test_split(sentiment_data,
                                  test_size=TEST_SIZE,
                                  train_size=train_size,
                                  random_state=42)

    X_train.append(train)
    X_test.append(test)

# Combine and shuffle the datasets
X_train = pd.concat(X_train).sample(frac=1, random_state=10)
X_test = pd.concat(X_test)

# Reset index for training data
X_train = X_train.reset_index(drop=True)

# Print the distribution of samples
print("\nDistribution of samples:")
print("Training set distribution:")
print(X_train.label_text.value_counts())
print("\nTest set distribution:")
print(X_test.label_text.value_counts())


Distribution of samples:
Training set distribution:
label_text
neutral     1248
positive    1248
negative     504
Name: count, dtype: int64

Test set distribution:
label_text
positive    100
negative    100
neutral     100
Name: count, dtype: int64


In [8]:
X_train

Unnamed: 0,y,text,text_pt,label_text
0,neutral,AffectoGenimap builds highly customised IT sol...,A AffectoGenimap desenvolve soluções de TI alt...,neutral
1,positive,Tiimari Latvian representative Ineta Zaharova ...,"A representante letã da Tiimari, Ineta Zaharov...",positive
2,negative,Repeats sees 2008 operating profit down y-y ( ...,Repeats vê lucro operacional em 2008 em queda ...,negative
3,positive,India 's trade with Russia currently stands at...,O comércio da Índia com a Rússia atualmente é ...,positive
4,positive,It is now the leading private road ambulance s...,É agora a empresa privada líder em serviços de...,positive
...,...,...,...,...
2995,neutral,"RK Group , headquartered in Vantaa , Finland ,...","O RK Group, com sede em Vantaa, Finlândia, é u...",neutral
2996,positive,"Metrics in QPR ScoreCard now support date , te...",As métricas no QPR ScoreCard agora oferecem su...,positive
2997,negative,Sales at the Tiimari business went down by 8 %...,As vendas no negócio Tiimari caíram 8% para EU...,negative
2998,positive,M-real Corporation Press release on 3 November...,M-real Corporation Comunicado de imprensa em 3...,positive


### Prompt Engineering

The prompt below instructs the model to:
1. Act as a financial analyst
2. Analyze text sentiment from a market perspective
3. Provide reasoning for each classification
4. Output in a structured JSON format

In [8]:
stock_sentiment_cot_prompt = """\
You are a highly qualified financial analyst expert trained to annotate machine learning training data.
Your task is to briefly analyze the sentiment in the NEWS TEXT below from a stock market perspective and then label it with only one of the three labels:
positive (bullish), negative (bearish), neutral.
Base your label decision only on the NEWS TEXT and how it might impact market sentiment or stock prices. Do not speculate beyond the provided information.
You first reason step by step about the correct label and then return your label.
You ALWAYS respond once and exclusively in the following JSON format with brackets: {{"reason": "...", "label": "..."}}

Examples:
Text: Company XYZ Reports Q4 Earnings Above Expectations, Raises Full-Year Guidance
JSON: {"reason": "The company beat earnings estimates and increased guidance, which typically signals strong financial performance and future growth prospects", "label": "positive"}

Text: Federal Reserve Maintains Current Interest Rates
JSON: {"reason": "The text is a factual statement about monetary policy without indicating a significant change or market impact", "label": "neutral"}

Text: Major Tech Company Announces 15% Workforce Reduction Amid Revenue Decline
JSON: {"reason": "The layoffs and revenue decline indicate business challenges and potential profitability issues, suggesting negative market sentiment", "label": "negative"}

Text: New Partnership Announced Between AI Leader and Cloud Computing Giant
JSON: {"reason": "Strategic partnership between major companies typically suggests growth opportunities and market expansion", "label": "positive"}

Text: Monthly Inflation Rate Remains Unchanged at 3.1%
JSON: {"reason": "The stable inflation rate without significant change doesn't indicate a clear market direction", "label": "neutral"}

Text: Global Supply Chain Disruptions Worsen, Leading to Production Delays
JSON: {"reason": "Supply chain issues and production delays typically impact company operations and revenues negatively", "label": "negative"}

Text: SEC Begins Standard Review of Bitcoin ETF Applications
JSON: {"reason": "A routine regulatory review without clear approval or rejection doesn't suggest a definitive market impact", "label": "neutral"}

Text: Electric Vehicle Maker Reports Record Quarterly Deliveries
JSON: {"reason": "Record deliveries indicate strong demand and operational execution, positive for company growth", "label": "positive"}

Text: Credit Rating Agency Downgrades Major Bank's Debt
JSON: {"reason": "Credit downgrade suggests increased financial risk and potentially higher borrowing costs", "label": "negative"}
"""

## 5. Annotation Pipeline

Implementing a robust annotation system with:
- Retry logic for API failures
- Progress tracking
- Comprehensive error handling
- Structured output saving

In [12]:
import requests
import json
import time
import logging
from typing import Dict, Any
from google.colab import userdata

class ModelInvocationError(Exception):
    pass

def invoke_deep_seek_model_with_retry(
    text: str,
    max_retries: int = 3,
    retry_delay: float = 10
) -> Dict[str, Any]:
    """
    Invokes the DeepSeek API with retry logic and consistent delays.

    Args:
        text: Input text for sentiment analysis
        max_retries: Maximum number of retry attempts
        retry_delay: Delay in seconds when retrying after an error

    Returns:
        Dict containing the model's response

    Raises:
        ModelInvocationError: If all retry attempts fail
    """
    url = "https://api.hyperbolic.xyz/v1/chat/completions"

    auth = f"Bearer {userdata.get('HYPERBOLIC_API_KEY')}"
    headers = {
        "Content-Type": "application/json",
        "Authorization": auth
    }
    data = {
        "messages": [
             {
                "role": "system",
                "content": stock_sentiment_cot_prompt
            },
            {
                "role": "user",
                "content": text
            }
        ],
        "model": "deepseek-ai/DeepSeek-V3",
        "max_tokens": 256,
        "temperature": 0.4,
        "top_p": 0.9
    }

    attempt = 0

    while attempt <= max_retries:
        try:
            response = requests.post(url, headers=headers, json=data)

            # Check for HTTP errors
            response.raise_for_status()

            result = response.json()

            content = result.get('choices')[0].get('message').get('content')
            # If needed, strip any extra curly braces
            if content.startswith('{{') and content.endswith('}}'):
                content = content[1:-1]

            return json.loads(content)


        except Exception as e:
            attempt += 1
            if attempt > max_retries:
                raise ModelInvocationError(f"Failed after {max_retries} attempts: {str(e)}")

            if "OUTPUT_PARSING_FAILURE" in str(e):
                delay = 1
            else:
                delay = retry_delay

            logging.warning(
                f"Attempt {attempt}/{max_retries} failed: {str(e)}. "
                f"Retrying in {delay} seconds..."
            )
            time.sleep(delay)

    raise ModelInvocationError("Unexpected error: Reached end of retry loop")



In [13]:
import json
import csv
from time import sleep
from datetime import datetime
from typing import Dict, Any
import logging
from tqdm import tqdm

logging.basicConfig(level=logging.INFO,
                   format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)


def annotate_training_data(
    X_train,
    output_file: str = 'annotated_deep_seek_train.csv',
    max_retries: int = 5
) -> None:
    """
    Annotates training data with Mistral large sentiment analysis and saves to CSV.

    Args:
        X_train: DataFrame or array-like containing 'id', 'text', 'label', 'label_text' columns
        output_file: Path to save the annotated CSV file
        max_retries: Maximum number of retry attempts for model invocation
    """
    timestamp = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
    logger.info(f"Starting annotation process at {timestamp}")

    fieldnames = [
        'text',
        'label_text',
        'deep_seek_v3_reason',
        'deep_seek_v3_label_text',
        'deep_seek_v3_label'
    ]

    processed_count = 0
    error_count = 0

    # Disable logging to stdout when using tqdm
    logging_handler = logging.StreamHandler()
    logger.removeHandler(logging_handler)

    with open(output_file, 'w', newline='', encoding='utf-8') as outfile:
        writer = csv.DictWriter(outfile, fieldnames=fieldnames)
        writer.writeheader()

        # Create progress bar with dynamic_ncols=True
        pbar = tqdm(total=len(X_train), desc="Processing samples",
                   unit="sample", dynamic_ncols=True,
                   position=0, leave=True)

        for idx, row in X_train.iterrows():
            try:
                # Invoke model with retry logic
                result = invoke_deep_seek_model_with_retry(
                    row['text_pt'],
                    max_retries=max_retries
                )

                # Prepare row data
                row_data = {
                    'text': row['text_pt'],
                    'label_text': row['label_text'],
                    'deep_seek_v3_reason': result['reason'],
                    'deep_seek_v3_label_text': result['label'],
                    'deep_seek_v3_label': 1 if result['label'] == 'positive' else 0
                }

                # Write row and flush immediately
                writer.writerow(row_data)
                outfile.flush()

                processed_count += 1

                # Update progress bar using set_postfix for additional info
                pbar.set_postfix({
                    'Last': f"{row['label_text']}→{result['label']}",
                    'Success': f"{processed_count}/{len(X_train)}"
                })
                pbar.update(1)

            except ModelInvocationError as e:
                error_count += 1
                # Write error to progress bar instead of logging
                pbar.write(f"Error processing sample {idx}: {str(e)}")
                pbar.update(1)
                continue

        # Close progress bar
        pbar.close()

    # Re-enable logging
    logger.addHandler(logging_handler)

    # Log final statistics
    completion_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
    logger.info(f"\nAnnotation completed at {completion_time}")
    logger.info(f"Annotated data saved to: {output_file}")
    logger.info(f"Total samples processed: {processed_count}")
    logger.info(f"Total errors: {error_count}")

    if error_count > 0:
        logger.warning(
            f"Completed with {error_count} errors. "
            f"Success rate: {(processed_count/(processed_count+error_count))*100:.1f}%"
        )

In [17]:
annotate_training_data(X_train, "annotated_deep_seek_train.csv")

## 6. Results Analysis

Analyzing the quality of LLM annotations:
- Comparison with original labels
- Accuracy metrics
- Distribution of predictions

In [23]:
annotated_df = pd.read_csv('annotated_deep_seek_train.csv')
annotated_df.head()

Unnamed: 0,text,label_text,deep_seek_v3_reason,deep_seek_v3_label_text,deep_seek_v3_label
0,A AffectoGenimap desenvolve soluções de TI alt...,neutral,The text describes the company's focus on prov...,neutral,0
1,"A representante letã da Tiimari, Ineta Zaharov...",positive,The company's profit increased significantly b...,positive,1
2,Repeats vê lucro operacional em 2008 em queda ...,negative,The text indicates a decline in operating prof...,negative,0
3,O comércio da Índia com a Rússia atualmente é ...,positive,The text highlights a growth in trade between ...,neutral,0
4,É agora a empresa privada líder em serviços de...,positive,The text highlights the company's leading posi...,positive,1


In [25]:
import pandas as pd
from sklearn.metrics import accuracy_score

# Calculate accuracy
accuracy = accuracy_score(annotated_df['label_text'], annotated_df['deep_seek_v3_label_text'])

print(f"Accuracy of deep-seek on training data: {accuracy:.2%}")

Accuracy of deep-seek on training data: 80.25%


In [26]:
source_file = "annotated_deep_seek_train.csv"

# Define the destination file path on Google Drive
destination_file = "/content/drive/MyDrive/projects/stock-market-sentiment/deepseek/v2/train.csv"

!mkdir -p /content/drive/MyDrive/projects/stock-market-sentiment/deepseek/v2/
!cp $source_file $destination_file

print(f"File copied to: {destination_file}")

File copied to: /content/drive/MyDrive/projects/stock-market-sentiment/deepseek/v2/train.csv


In [27]:
ANNOTATED_TEST_FILE_PATH = 'annotated_test.csv'
annotate_training_data(X_test, ANNOTATED_TEST_FILE_PATH)

In [28]:
destination_test_file = f"/content/drive/MyDrive/projects/stock-market-sentiment/deepseek/v2/{ANNOTATED_TEST_FILE_PATH}"

!mkdir -p /content/drive/MyDrive/projects/stock-market-sentiment/deepseek/v2/
!cp $ANNOTATED_TEST_FILE_PATH $destination_test_file

print(f"File copied to: {destination_test_file}")


File copied to: /content/drive/MyDrive/projects/stock-market-sentiment/deepseek/v2/annotated_test.csv


## 7. Model Training Details

### Training Process
We used HuggingFace AutoTrain Advanced for model training, leveraging Google Colab's GPU capabilities. The process involved:

- **Teacher Model**: DeepSeek V3 LLM for dataset annotation
- **Student Model**: BERT base for Portuguese (neuralmind/bert-base-portuguese-cased)
- **Training Platform**: Google Colab with AutoTrain Advanced
- **Final Model**: Available at `winderfeld/bert-portuguese-deepseek-sentiment-analysis`

### Training Configuration
```json
{
    "auto_find_batch_size": "false",
    "eval_strategy": "epoch",
    "mixed_precision": "fp16",
    "optimizer": "adamw_torch",
    "scheduler": "linear",
    "batch_size": "16",
    "early_stopping_patience": "5",
    "early_stopping_threshold": "0.01",
    "epochs": "5",
    "gradient_accumulation": "1",
    "lr": "0.00005",
    "logging_steps": "-1",
    "max_grad_norm": "1",
    "max_seq_length": "128",
    "save_total_limit": "1",
    "seed": "42",
    "warmup_ratio": "0.1",
    "weight_decay": "0"
}
```