**1. Install Necessary Libraries**

First, install the necessary libraries for our project, including transformers, datasets, evaluate, gradio, youtube-transcript-api, and pydub.

In [2]:
!pip install transformers datasets evaluate gradio youtube-transcript-api pydub




**2. Import Libraries**

Next, import the libraries that we will use throughout the project.

In [3]:
import torch
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
from datasets import load_dataset
import evaluate
import gradio as gr


**3. Load the Dataset**

Load our dataset from a CSV file. This dataset contains two columns: URL and Label.

In [4]:
dataset = load_dataset('csv', data_files='dataset.csv')


**4. Load the Tokenizer**

We need to load the tokenizer for the T5 model, which will help in converting our text data into tokens that the model can process.

In [9]:
tokenizer = T5Tokenizer.from_pretrained("t5-small")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


**5. Preprocess the Dataset**

We will preprocess the dataset by tokenizing the input URLs and labels. This step is necessary to convert the text data into a format that the T5 model can understand.

In [10]:
def preprocess_function(examples):
    inputs = examples['URL']
    targets = examples['Label']
    model_inputs = tokenizer(inputs, max_length=512, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=512, truncation=True, padding="max_length")
    model_inputs['labels'] = labels['input_ids']
    return model_inputs

train_dataset = dataset['train'].map(preprocess_function, batched=True)


**6. Load the Model**

We will load the T5 model that we want to fine-tune on our dataset.

In [11]:
model = T5ForConditionalGeneration.from_pretrained("t5-small")


**7. Define the Evaluation Metric**

To evaluate our model's performance, we will use the accuracy metric. The evaluate library helps in computing this metric.

In [12]:
accuracy_metric = evaluate.load("accuracy")


**8. Define the Compute Metrics Function**

We will define a function to compute the accuracy of our model. This function will be used by the Trainer during the training process.

In [13]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Replace the placeholder token '▁' with a space
    decoded_preds = [pred.replace('▁', ' ') for pred in decoded_preds]
    decoded_labels = [label.replace('▁', ' ') for label in decoded_labels]

    # Strip leading and trailing spaces
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [label.strip() for label in decoded_labels]

    # Compute the accuracy
    result = accuracy_metric.compute(predictions=decoded_preds, references=decoded_labels)
    return result


**9. Set Up Training Arguments**

We will define the training arguments, which specify the hyperparameters for the training process, such as the number of epochs, batch size, and learning rate.

In [14]:
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)




**10. Initialize the Trainer**

Now, we initialize the Trainer with our model, training arguments, training dataset, and the compute metrics function.

In [15]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
    compute_metrics=compute_metrics
)


**11. Train the Model**

We can now start training the model. The trainer.train() function will fine-tune the T5 model on our dataset.

In [1]:
trainer.train()


NameError: name 'trainer' is not defined

**12. Evaluate the Model**

After training, we evaluate the model on the training dataset to see how well it performs. This will give us the accuracy of the model as a percentage.

In [1]:
results = trainer.evaluate()
accuracy = results['eval_accuracy'] * 100
print(f"Model Accuracy: {accuracy:.2f}%")


NameError: name 'trainer' is not defined

**13. Set Up Gradio Interface**

Gradio provides an easy way to create interactive web interfaces for your models. We will create an interface that allows users to input a YouTube URL and get the sentiment prediction from our fine-tuned T5 model.

In [None]:
def sentiment_analysis(url):
    # Preprocess the input URL
    inputs = tokenizer(url, return_tensors="pt", max_length=512, truncation=True, padding="max_length")
    # Generate the prediction
    outputs = model.generate(**inputs)
    # Decode the prediction
    sentiment = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return sentiment

# Create the Gradio interface
iface = gr.Interface(
    fn=sentiment_analysis,
    inputs="text",
    outputs="text",
    title="YouTube Video Sentiment Analysis",
    description="Enter a YouTube video URL to get the sentiment analysis result."
)

# Launch the interface
iface.launch()
