# 🧩 Gerar Datasets do Tutorial SBERT + ContrastiveLoss

Este notebook gera os datasets padronizados para o tutorial:

- `train.csv` → usado para treino do modelo
- `test.csv` → usado pelos alunos (sem labels)
- `ground_truth.csv` → contém o id e as respostas corretas (predictions verdadeiros)
- `sample_submission.csv` → modelo de submissão (id + predictions)

Formato padronizado:
```
train.csv          → sentence1, sentence2, similarity_score
test.csv           → id, sentence1, sentence2
ground_truth.csv   → id, predictions
sample_submission.csv → id, predictions
```

## ⚙️ 1. Importações e setup

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
import os

## 📂 2. Carregar o dataset original

O arquivo `train_original.csv` deve conter as colunas:
- `sentence1`
- `sentence2`
- `similarity_score`

In [2]:
DATA_DIR = 'data'
os.makedirs(DATA_DIR, exist_ok=True)

original_path = os.path.join(DATA_DIR, 'train_original.csv')
df = pd.read_csv(original_path)
print(f"✅ Dataset original carregado com {len(df)} amostras.")
df.head()

✅ Dataset original carregado com 6040 amostras.


Unnamed: 0,sentence1,sentence2,similarity_score
0,The cat is attacking a corn husk broom.,Grey and white cat sitting in bathroom sink.,0.8
1,"It depends on what you want to do next, and wh...",It's up to you what you want to do next.,4.0
2,A spokeswoman at Strong Memorial Hospital said...,A spokesman at Strong Memorial Hospital said D...,3.0
3,"Nepal earthquake death toll surpasses 7,000","Death toll in Nepal earthquake tops 8,000",3.0
4,The man is pushing the van.,The woman is singing.,0.0


## ✂️ 3. Embaralhar e dividir em treino e teste

In [3]:
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
print(f"Treino: {len(df_train)} | Teste: {len(df_test)}")

Treino: 4832 | Teste: 1208


## 🧮 4. Adicionar IDs e criar arquivos padronizados

In [4]:
# Adiciona um ID único para cada amostra de teste
df_test = df_test.reset_index(drop=True)
df_test['id'] = df_test.index

# GROUND TRUTH (id + predictions verdadeiros)
df_ground_truth = df_test[['id', 'similarity_score']].rename(columns={'similarity_score': 'predictions'})

# TEST (sem labels)
df_test_public = df_test[['id', 'sentence1', 'sentence2']]

# SAMPLE SUBMISSION (estrutura padrão para aluno)
sample_submission = pd.DataFrame({
    'id': df_test['id'],
    'predictions': [0.0] * len(df_test)
})

## 💾 5. Salvar os arquivos finais

In [5]:
train_path = os.path.join(DATA_DIR, 'train.csv')
test_path = os.path.join(DATA_DIR, 'test.csv')
truth_path = os.path.join(DATA_DIR, 'ground_truth.csv')
sample_path = os.path.join(DATA_DIR, 'sample_submission.csv')

# Salvar cada um
df_train.to_csv(train_path, index=False)
df_test_public.to_csv(test_path, index=False)
df_ground_truth.to_csv(truth_path, index=False)
sample_submission.to_csv(sample_path, index=False)

print('✅ Arquivos criados com sucesso:')
print('-', train_path)
print('-', test_path)
print('-', truth_path)
print('-', sample_path)

✅ Arquivos criados com sucesso:
- data/train.csv
- data/test.csv
- data/ground_truth.csv
- data/sample_submission.csv


## 🔍 6. Verificar amostras

In [6]:
print('Treino:')
display(df_train.head(3))
print('\nTeste:')
display(df_test_public.head(3))
print('\nGround Truth:')
display(df_ground_truth.head(3))
print('\nSample Submission:')
display(sample_submission.head(3))

Treino:


Unnamed: 0,sentence1,sentence2,similarity_score
1121,A woman puts make-up on.,A woman is putting on eyeshadow.,3.333
4431,Scientists prove there is water on Mars,Has Nasa discovered water on Mars?,2.0
4060,Ban Ki-moon to Review Syria Chemical Arms Accord,Ban to review Syria chemical arms accord,3.6



Teste:


Unnamed: 0,id,sentence1,sentence2
0,0,A man is singing and playing a guitar.,A man is playing a guitar.
1,1,Suarez set for Cup comeback,French set for Mali ground combat
2,2,It's not a good idea.,It's a good question.



Ground Truth:


Unnamed: 0,id,predictions
0,0,3.6
1,1,0.8
2,2,0.0



Sample Submission:


Unnamed: 0,id,predictions
0,0,0.0
1,1,0.0
2,2,0.0


## ✅ Conclusão

Os datasets foram gerados e seguem a seguinte estrutura:

```
train.csv          → treino supervisionado
test.csv           → ids e pares de sentenças (sem labels)
ground_truth.csv   → ids e respostas corretas
sample_submission.csv → estrutura que os alunos devem gerar
```