# Set up

In [None]:
!pip install -q scikit-learn datasets transformers[torch] evaluate accelerate

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import numpy as np
import re
import sklearn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, accuracy_score
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
import pandas as pd
from datasets import load_dataset, load_metric
import torch
import time
from transformers import AutoModelForSequenceClassification, Trainer, TrainingArguments, AutoTokenizer, DataCollatorWithPadding, AutoConfig, XLNetTokenizer, XLNetModel
import transformers
from torch.utils.data import Dataset, DataLoader
from transformers import RobertaModel, RobertaTokenizer
from tqdm.auto import tqdm
from torch import cuda
from accelerate import Accelerator, DataLoaderConfiguration
from evaluate import evaluator



In [None]:
device = 'cuda' if cuda.is_available() else 'cpu'

# Functions


In [None]:
def preprocess_function(examples):
    # Tokenize the inputs and labels
    tokenized_inputs = tokenizer(examples['text'], padding=True, truncation=True, max_length=512)

    # Ensure labels are correctly formatted (assuming 'label' is your label field)
    # This is just an illustrative step; actual implementation might differ based on your dataset
    tokenized_inputs['labels'] = [int(label) for label in examples['label']]

    return tokenized_inputs

In [None]:
def compute_metrics(eval_pred):
    # load the metrics to use
    load_accuracy = load_metric("accuracy")
    load_f1 = load_metric("f1")

    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    # calculate the mertic using the predicted and true value
    accuracy = load_accuracy.compute(predictions=predictions, references=labels)
    f1 = load_f1.compute(predictions=predictions, references=labels, average="weighted")
    return {"accuracy": accuracy, "f1score": f1}

In [None]:
def compute_metrics1(eval_pred):
    precision_metric = evaluate.load("precision")
    recall_metric = evaluate.load("recall")
    f1_metric= evaluate.load("f1")
    accuracy_metric = evaluate.load("accuracy")

    logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model
    predictions = np.argmax(logits, axis=-1)
    averaging_method = "macro"

    # Compute the metrics
    precision = precision_metric.compute(predictions=predictions, references=labels, average=averaging_method)["precision"]
    recall = recall_metric.compute(predictions=predictions, references=labels, average=averaging_method)["recall"]
    f1 = f1_metric.compute(predictions=predictions, references=labels, average=averaging_method)["f1"]
    accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"]
    # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores.
    return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy}

# Loading Data

In [None]:
dataset = load_dataset('AlexanderBenady/generated_lectures')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/5.74M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
all_data = pd.DataFrame(dataset["train"])
all_data.head()

Unnamed: 0,Field,Topic,Lecture Topic,Lecture,Summary
0,Social Sciences,Philosophy,The Foundations of Western Philosophy: Explori...,"Today, we are delving into the captivating wor...","The lecture explores Pre-Socratic philosophy, ..."
1,Social Sciences,Philosophy,Plato's Theory of Forms: Ideals and Realities,"Today, we delve into one of the most fascinati...","The lecture explores Plato's Theory of Forms, ..."
2,Social Sciences,Philosophy,Aristotle's Virtue Ethics: The Golden Mean,"Hello everyone, today we are diving into the f...",The lecture explores Aristotle's Virtue Ethics...
3,Social Sciences,Philosophy,Stoicism and its Relevance in Modern Life,"Welcome, everyone. Today, we are going to delv...","The lecture explores Stoicism, an ancient phil..."
4,Social Sciences,Philosophy,Eastern Philosophies: Daoism and its Conceptio...,"Welcome, everyone. Today, we are going to dive...",The lecture explores the Eastern philosophy of...


In [None]:
all_data = all_data.replace('\n', '', regex=True)

In [None]:
topics = all_data['Field'].unique()  # Extract unique topics
topic_to_id = {topic: id for id, topic in enumerate(topics)}  # Create a mapping from topic to integer

# Apply the mapping to your 'Topic' column to create a new 'label' column
all_data['label'] = all_data['Field'].map(topic_to_id)

In [None]:
# Divide the data into 80% training, 10% validation, and 10% testing data
train_data, test_data, train_target, test_target = train_test_split(all_data['Summary'], all_data['label'], test_size=0.2, stratify=all_data['label'], random_state=42)
validation_data, test_data, validation_target, test_target = train_test_split(test_data, test_target, test_size=0.5, stratify=test_target, random_state=42)

In [None]:
#Dataset into df
# Combine training data and target into a DataFrame
train_df = pd.DataFrame({
    'text': train_data,
    'label': train_target
})

# Combine validation data and target into a DataFrame
validation_df = pd.DataFrame({
    'text': validation_data,
    'label': validation_target
})

# Combine testing data and target into a DataFrame
test_df = pd.DataFrame({
    'text': test_data,
    'label': test_target
})

In [None]:
hf_train_dataset = Dataset.from_pandas(train_df)
hf_validation_dataset = Dataset.from_pandas(validation_df)
hf_test_dataset = Dataset.from_pandas(test_df)

# Model Training

In [None]:
model_checkpoint = "roberta-base"
batch_size = 32

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

In [None]:
# Apply the preprocessing function
tokenized_train_dataset = hf_train_dataset.map(preprocess_function, batched=True)
tokenized_validation_dataset = hf_validation_dataset.map(preprocess_function, batched=True)
tokenized_test_dataset = hf_test_dataset.map(preprocess_function, batched=True)

In [None]:
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels = 5)

In [None]:
data_collator = DataCollatorWithPadding(tokenizer)

In [None]:
model.to(device)

In [None]:
# Set up model
model_name = model_checkpoint.split("/")[-1]

args = TrainingArguments(
    output_dir=f"{model_name}-finetuned-classifier-roberta1",
    evaluation_strategy="epoch",
    learning_rate=2e-05,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.001,
    push_to_hub=False,
    seed=42,
    adam_beta1=0.9,
    adam_beta2=0.999,
    adam_epsilon=1e-08,
    lr_scheduler_type="linear"
)

In [None]:
# Set up trainer
trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_train_dataset,
    eval_dataset=tokenized_validation_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics1,
    data_collator = data_collator
)


In [None]:
trainer.train()

In [None]:
trainer.evaluate()

In [None]:
trainer.eval_dataset = tokenized_test_dataset
test_evaluated = trainer.evaluate()
test_evaluated

## Publish model to HuggingFace

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
trainer.push_to_hub("End of training")

# Final Classifier Model

This fine-tuned RoBerta model achieves a **92%** accuracy and weighted F1 score on the test set.

In [None]:
model_name = "gserafico/roberta-base-finetuned-classifier-roberta1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [None]:
def classify_text(text):
    # Encode the text into tensor
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)

    # Predict using the model
    with torch.no_grad():
        outputs = model(**inputs)

    # Get the predicted class index
    predicted_class_idx = torch.argmax(outputs.logits, dim=1).item()

    # Map index to label
    labels = {
        0: 'Social Sciences',
        1: 'Arts',
        2: 'Natural Sciences',
        3: 'Business and Law',
        4: 'Engineering and Technology'
    }
    return labels[predicted_class_idx]

In [None]:
new_text2 = "Generative AI refers to artificial intelligence technologies capable of creating content, such as text, images, videos, and music, that resemble human-like artifacts. A lecture on this topic might cover several core areas including the technology's fundamentals, applications, ethical implications, and future potential.Starting with the fundamentals, generative AI leverages deep learning models, particularly generative adversarial networks (GANs) and variational autoencoders (VAEs). These models learn to produce new data points indistinguishable from real data by training on large datasets. The lecture might delve into how GANs involve two competing neural networks—a generator and a discriminator—where the generator learns to make fake data and the discriminator learns to distinguish fake from real data.Applications of generative AI are extensive and transformative across various sectors. In art, algorithms like DALL-E create compelling images based on textual descriptions. In journalism, tools like GPT (Generative Pre-trained Transformer) automate content creation, significantly speeding up the writing process. In the field of medicine, generative models help design new molecules for drug development.Ethical considerations are critical in discussions about generative AI. The technology raises concerns regarding privacy, misinformation, and copyright issues as the line between real and AI-generated content blurs. The lecture might explore scenarios such as deepfakes influencing public opinion or AI inappropriately using copyrighted material.Looking ahead, the future of generative AI seems promising yet challenging. Continuous advancements are likely to enhance creative potential and problem-solving capabilities. However, managing the societal impacts and ethical challenges will be crucial for harnessing the benefits of generative AI while mitigating risks. The lecture would emphasize the importance of interdisciplinary efforts in ensuring responsible development and deployment of these technologies."

In [None]:
print('This summarized lecture is within the field of',classify_text(new_text2))