<a href="https://colab.research.google.com/github/hukgithub/fp_ft_ai_model/blob/main/FuturePath_py.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Step 1: Install Required Libraries**

In [12]:
!pip install transformers datasets sentence-transformers scikit-learn streamlit fsspec==2024.10.0 --quiet

**Step 2: Import Required Libraries**

In [13]:
import pandas as pd
import numpy as np
import torch
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from sentence_transformers import SentenceTransformer
from datasets import Dataset
from sklearn.model_selection import train_test_split

**Step 3: Load and Inspect the Dataset**

In [14]:
# Load the dataset
file_path = "/content/Students_Dataset_1.csv"  # Update the path if needed
students_data = pd.read_csv(file_path)

# Display the column names
print("Columns in the dataset:", students_data.columns)

# Display the first few rows to understand the data
print(students_data.head())

Columns in the dataset: Index(['Name', 'Age', 'Abilities', 'Interest', 'Family Background', 'Culture',
       'Region', 'Country', 'Home Environment', 'Relationship with Family',
       'Thinking', 'Dreams', 'Strengths and Weaknesses'],
      dtype='object')
             Name  Age     Abilities    Interest Family Background  \
0     Priya Ahmed   15  Mathematical  Technology      Joint Family   
1    Rachel Smith   14  Mathematical   Traveling    Nuclear Family   
2    Hannah Singh   17      Artistic         Art    Nuclear Family   
3  Mohammed Singh   18      Artistic       Music    Nuclear Family   
4      Aisha Levi   15       Musical      Sports    Nuclear Family   

       Culture Region   Country Home Environment Relationship with Family  \
0        Mixed  Rural       USA           Strict                    Close   
1  Traditional  Urban  Pakistan          Neutral               Supportive   
2       Modern  Rural    Israel      Encouraging               Conflicted   
3       Mode

**Dataset Validation:**

Add validation to check for missing or malformed data in the features before creating the text column.

In [15]:
assert students_data[target_columns].notnull().all().all(), "Missing values in target columns."

**Test Dataset:**

Include a separate test set to evaluate model generalization after training.

**Evaluation Metrics:**

Enhance the training pipeline by adding metrics like accuracy, F1-score, precision, and recall.

**Step 4: Prepare the Dataset**

Define a target column. For example, we’ll predict a student’s Interest based on other features.

In [16]:
# Define target columns as a list
target_columns = [
    "Abilities", "Interest", "Family Background", "Culture", "Region",
    "Country", "Home Environment", "Relationship with Family",
    "Thinking", "Dreams", "Strengths and Weaknesses"
]

# Preprocess dataset: drop rows with missing target values
students_data = students_data.dropna(subset=target_columns)

# Combine all target columns into a single text feature for modeling
students_data["text"] = students_data[target_columns].apply(
    lambda x: " ".join(x.astype(str)), axis=1
)

# Map labels to numeric values based on one target column (e.g., Interest)
students_data["field_id"] = students_data["Interest"].factorize()[0]


# This code splits the data into 70% training, 15% validation, and 15% test datasets.
# Split the dataset into training, validation, and test sets

train_data, temp_data = train_test_split(students_data, test_size=0.3, random_state=42)
val_data, test_data = train_test_split(temp_data, test_size=0.5, random_state=42)

# Test set will be used only after training for evaluation



**Step 5: Tokenization**

In [17]:
# Import Dataset class
from datasets import Dataset

# Initialize the tokenizer
tokenizer = DistilBertTokenizer.from_pretrained("distilbert-base-uncased")

# Tokenization function
def tokenize_function(examples):
    return tokenizer(
        examples["text"], padding="max_length", truncation=True, max_length=128
    )

# Create 'text' column by combining relevant features
train_data["text"] = train_data[
    ["Abilities", "Interest", "Family Background", "Culture", "Region", "Country", "Home Environment",
     "Relationship with Family", "Thinking", "Dreams", "Strengths and Weaknesses"]
].apply(lambda x: " ".join(x.astype(str)), axis=1)

val_data["text"] = val_data[
    ["Abilities", "Interest", "Family Background", "Culture", "Region", "Country", "Home Environment",
     "Relationship with Family", "Thinking", "Dreams", "Strengths and Weaknesses"]
].apply(lambda x: " ".join(x.astype(str)), axis=1)

# Convert to Hugging Face dataset format
train_dataset = Dataset.from_pandas(train_data[["text", "field_id"]])
val_dataset = Dataset.from_pandas(val_data[["text", "field_id"]])

# Tokenize datasets
tokenized_train_dataset = train_dataset.map(tokenize_function, batched=True)
tokenized_val_dataset = val_dataset.map(tokenize_function, batched=True)

# Rename 'field_id' to 'labels' for training compatibility
tokenized_train_dataset = tokenized_train_dataset.rename_column("field_id", "labels")
tokenized_val_dataset = tokenized_val_dataset.rename_column("field_id", "labels")

# Remove unnecessary columns
tokenized_train_dataset = tokenized_train_dataset.remove_columns(["text"])
tokenized_val_dataset = tokenized_val_dataset.remove_columns(["text"])

# Set format for PyTorch
tokenized_train_dataset.set_format("torch")
tokenized_val_dataset.set_format("torch")


Map:   0%|          | 0/700 [00:00<?, ? examples/s]

Map:   0%|          | 0/150 [00:00<?, ? examples/s]

**Step 6: Initialize the Model**

In [18]:
# Number of unique classes
num_classes = len(students_data["field_id"].unique())

# Load DistilBERT model
model = DistilBertForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=num_classes
)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Step 7: Configure Training**

In [21]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Define evaluation metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, predictions, average="weighted")
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc, "f1": f1, "precision": precision, "recall": recall}

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    logging_dir="./logs",
    report_to="none"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_val_dataset,
    compute_metrics=compute_metrics
)

In [24]:
print(tokenized_train_dataset[0])
print(tokenized_val_dataset[0])

{'labels': tensor(5), '__index_level_0__': tensor(541), 'input_ids': tensor([  101,  4087,  3752,  4517,  2155,  3151,  3923,  2710,  9384,  2485,
        17826,  7155,  3997,  1024,  2524, 21398,  1010, 11251,  1024, 11004,
          102,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,   

**Step 8: Train the Model**

In [25]:
# Start training
trainer.train()

# Save model and tokenizer
model.save_pretrained("./fine_tuned_distilbert")
tokenizer.save_pretrained("./fine_tuned_distilbert")

Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.902336,1.0,1.0,1.0,1.0
2,No log,0.240594,1.0,1.0,1.0,1.0
3,No log,0.158627,1.0,1.0,1.0,1.0


('./fine_tuned_distilbert/tokenizer_config.json',
 './fine_tuned_distilbert/special_tokens_map.json',
 './fine_tuned_distilbert/vocab.txt',
 './fine_tuned_distilbert/added_tokens.json')

**Step 9: Test the Model**

In [26]:
# Load the trained model for inference
trained_model = DistilBertForSequenceClassification.from_pretrained("./fine_tuned_distilbert")
trained_tokenizer = DistilBertTokenizer.from_pretrained("./fine_tuned_distilbert")

# Function for predictions
import torch.nn.functional as F #Test the Model to include softmax for probability outputs.

def predict(text):
    inputs = trained_tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
    outputs = trained_model(**inputs)
    probabilities = F.softmax(outputs.logits, dim=1)  # Apply softmax to logits
    predictions = torch.argmax(outputs.logits, dim=1)
    return predictions.item(), probabilities.tolist()  # Return prediction and probabilities
# This allows you to get the predicted class and its associated probabilities.

# Test with a sample
sample_text = "Creative problem-solving strong family support cultural adaptability."
print("Predicted field:", predict(sample_text))

Predicted field: (5, [[0.20099805295467377, 0.11286281794309616, 0.1625814437866211, 0.14467796683311462, 0.13826507329940796, 0.2406146377325058]])


**1. Verify the Files Are Saved**

After running your code, check the /content directory in

 Colab to ensure the fine_tuned_distilbert folder exists:

In [27]:
!ls /content/fine_tuned_distilbert

config.json  model.safetensors	special_tokens_map.json  tokenizer_config.json	vocab.txt


**2. Zip the Folder**

Compress the folder into a .zip file directly within Colab:

This creates a file named fine_tuned_distilbert.zip in the /content directory.

In [28]:
!zip -r fine_tuned_distilbert.zip /content/fine_tuned_distilbert

  adding: content/fine_tuned_distilbert/ (stored 0%)
  adding: content/fine_tuned_distilbert/tokenizer_config.json (deflated 75%)
  adding: content/fine_tuned_distilbert/config.json (deflated 52%)
  adding: content/fine_tuned_distilbert/special_tokens_map.json (deflated 42%)
  adding: content/fine_tuned_distilbert/model.safetensors (deflated 8%)
  adding: content/fine_tuned_distilbert/vocab.txt (deflated 53%)


**3. Download the Zipped File**

Use the following code to download the zipped file from Colab:

A download prompt will appear, allowing you to save the fine_tuned_distilbert.zip file to your local machine.

In [29]:
from google.colab import files
files.download("fine_tuned_distilbert.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

**4. Upload to Your Deployment Environment**

Once you have the fine_tuned_distilbert.zip file, you can upload it to:

Your local directory if running Streamlit locally.

Your GitHub repository if deploying to Streamlit Cloud.

Hugging Face Spaces if deploying there.