Extract Data from MongoDB:
Use a MongoDB client (like pymongo) to query and retrieve data from your MongoDB collections. Convert the retrieved data into a format suitable for training, like a pandas DataFrame. Preprocess and Train:
Preprocess the data (e.g., tokenization, label encoding) using Hugging Face transformers and datasets. Train the model.
Once trained, you can push the model to the Hugging Face Hub or deploy it in your environment.
First off install the required packages
pip install -r requirements.txt
In your project folder name it what you wish add some JSON data
[
{
"text": "This is a positive example.",
"label": "positive"
},
{
"text": "This is a negative example.",
"label": "negative"
},
{
"text": "This is another positive example.",
"label": "positive"
},
{
"text": "This is another negative example.",
"label": "negative"
}
]
Then run hugfaceAuto.py training script - make sure you are in the same folder as your simple_test_data.json or pass the path to it
import pandas as pd
from sklearn.model_selection import train_test_split
from datasets import Dataset
from transformers import BertTokenizer, BertForSequenceClassification, TrainingArguments, Trainer
from huggingface_hub import login
import torch
login(token="put your WRITE token here") # Replace with your actual Hugging Face write token
df = pd.read_json('simple_test_data.json')
label_map = {'positive': 1, 'negative': 0}
df['label'] = df['label'].map(label_map)
train_df, test_df = train_test_split(df, test_size=0.2)
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=2)
def preprocess_function(examples):
return tokenizer(examples['text'], truncation=True, padding=True, max_length=128)
train_dataset = train_dataset.map(preprocess_function, batched=True)
test_dataset = test_dataset.map(preprocess_function, batched=True)
train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
num_train_epochs=3,
weight_decay=0.01,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
trainer.train()
trainer.evaluate()
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")
model.push_to_hub("your_model_name") # Replace with your desired model name on Hugging Face
tokenizer.push_to_hub("your_model_name")
model = BertForSequenceClassification.from_pretrained("./saved_model")
tokenizer = BertTokenizer.from_pretrained("./saved_model")
text = "This is a new example for prediction."
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_label = torch.argmax(predictions, dim=1).item()
print(f"Predicted label: {predicted_label}")
➜ huggingface python3 hugfaceAuto.py
Token is valid (permission: write).
Your token has been saved to /Users/jeffery.schmitz/.cache/huggingface/token
Login successful
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 567.13 examples/s]
Map: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 460.00 examples/s]
{'eval_loss': 0.6930631399154663, 'eval_runtime': 0.1085, 'eval_samples_per_second': 9.217, 'eval_steps_per_second': 9.217, 'epoch': 1.0}
{'eval_loss': 0.7359107732772827, 'eval_runtime': 0.0149, 'eval_samples_per_second': 67.151, 'eval_steps_per_second': 67.151, 'epoch': 2.0}
{'eval_loss': 0.7567771673202515, 'eval_runtime': 0.0144, 'eval_samples_per_second': 69.668, 'eval_steps_per_second': 69.668, 'epoch': 3.0}
{'train_runtime': 1.4217, 'train_samples_per_second': 6.33, 'train_steps_per_second': 2.11, 'train_loss': 0.735368569691976, 'epoch': 3.0}
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00, 2.12it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 624.62it/s]
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 438M/438M [05:20<00:00, 1.37MB/s]
Predicted label: 0
