# Sentimental Analysis of News Headlines 📰

## Overview

### Text Classification with DistilBERT



1. **Dataset Creation**: We start by creating a custom dataset called `dataset.json`. This dataset contains news headlines along with their corresponding sentiment labels (positive or negative). The data was obtain by scraping news websites using the Beautiful Soup library.

2. **Model Finetuning**: We then finetune the pre-trained DistilBERT model on the `dataset.json` dataset. Finetuning involves adapting the model's parameters to better fit the specific task of sentiment classification. This step is crucial for achieving high performance on our custom task.

3. **Inference**: After finetuning, we use the trained model to make predictions on new, unseen news headlines. Given a news headline, the model predicts whether it expresses a positive or negative sentiment.


## Objective
The objective of this project is to demonstrate how to finetune the DistilBERT model on a custom dataset for text classification. Specifically, we aim to classify news headlines as either positive or negative based on their sentiment.

Text classification is a fundamental Natural Language Processing (NLP) task that involves assigning a label or class to a piece of text. This project focuses on text classification using the DistilBERT model, a distilled version of the BERT model that retains much of its performance while being faster and more memory-efficient.


## Further developments:

1 . It was a little bit hard to find good news over the internet. What made me thought about doing a data analysis about how many bad news we have for each good one we find.

The text step is to gather more data to make this analisys from the sentiment model I just fine-tuned during this notebook.

Other resources to add to the dataset for GOOD NEWS:

  1. https://www.positive.news/
  2. https://www.today.com/news/good-news
  3. https://www.foxnews.com/category/good-news
  4. https://www.rescue.org/uk/hopeful-news?gad_source=1&gclid=Cj0KCQjw6auyBhDzARIsALIo6v8o-OBraUZ7ZCD8QgbdhEeYq60SURXZcN_bTfCTeZIw3g_VGmBg-EcaAmnuEALw_wcB

2. Another idea is to create an app that will scrap the internet and return at least one good new to the users, so they can have at least one cheerful news to keep the hope alive. :)

3. Inspired questions: how can we develop a better society focusing on the good news? if the bad news increase the anxiety and depression, can watch only good news become some sort of prescription to those who suffer from this mental problems?

## Resources about the impact of news on our brain and behaviour

1. [Negative news dominates fast and slow brain responses and social judgments even after source credibility evaluation](https://www.sciencedirect.com/science/article/pii/S1053811921008454)

2. [Protecting the brain against bad news](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8096381/)

3. [Is Watching the News Bad for Mental Health?](https://www.verywellmind.com/is-watching-the-news-bad-for-mental-health-4802320#:~:text=Consuming%20the%20news%20can%20activate,says%20physical%20symptoms%20may%20arise.)

4. [The Good News Effect](https://www.washingtonpost.com/brand-studio/mikes/the-good-news-effect/)

5. [How the news changes the way we think and behave](https://www.bbc.com/future/article/20200512-how-the-news-changes-the-way-we-think-and-behave)

6. [Media overload is hurting our mental health. Here are ways to manage headline stress](https://www.apa.org/monitor/2022/11/strain-media-overload)




# Project Development

In [None]:
! pip install transformers datasets evaluate --quiet

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from datasets import DatasetDict, Dataset

# Load JSON files
train_file = "train_dataset.json"
test_file = "test_dataset.json"

# Load datasets
train_data = Dataset.from_json(train_file)
test_data = Dataset.from_json(test_file)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Combine datasets into a DatasetDict
# this part was made to be uploaded to HuggingFace later at some point

headline_news = DatasetDict({"train": train_data, "test": test_data})

In [None]:
headline_news

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 192
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 48
    })
})

In [None]:
headline_news["train"][0]

{'label': 0, 'text': 'Play video: What is fuelling the far right in Germany?'}

## Preprocess

The next step is to load a DistilBERT tokenizer to preprocess the `text` field:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Create a preprocessing function to tokenize `text` and truncate sequences to be no longer than DistilBERT's maximum input length:

In [None]:
def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True)

To apply the preprocessing function over the entire dataset, use 🤗 Datasets [map](https://huggingface.co/docs/datasets/main/en/package_reference/main_classes#datasets.Dataset.map) function. You can speed up `map` by setting `batched=True` to process multiple elements of the dataset at once:

In [None]:
tokenized_news = headline_news.map(preprocess_function, batched=True)

Map:   0%|          | 0/192 [00:00<?, ? examples/s]

Map:   0%|          | 0/48 [00:00<?, ? examples/s]

In [None]:
tokenized_news

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 192
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'attention_mask'],
        num_rows: 48
    })
})

Now create a batch of examples using DataCollatorWithPadding. It's more efficient to dynamically pad the sentences to the longest length in a batch during collation, instead of padding the whole dataset to the maximum length.

In [None]:
from transformers import DataCollatorWithPadding

data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="tf")

## Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load a evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [accuracy](https://huggingface.co/spaces/evaluate-metric/accuracy) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric):

In [None]:
import evaluate

accuracy = evaluate.load("accuracy")


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [None]:
import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=labels)


The `compute_metrics` function is ready to go now, and we'll return to it when we setup the training.

## Train

Before you start training your model, create a map of the expected ids to their labels with `id2label` and `label2id`:

In [None]:
id2label = {0: "NEGATIVE", 1: "POSITIVE"}
label2id = {"NEGATIVE": 0, "POSITIVE": 1}

To finetune a model in TensorFlow, start by setting up an optimizer function, learning rate schedule, and some training hyperparameters:

In [None]:
len(tokenized_news["train"])

192

In [None]:
from transformers import create_optimizer
import tensorflow as tf

batch_size = 5
num_epochs = 3
batches_per_epoch = len(tokenized_news["train"]) // batch_size
total_train_steps = int(batches_per_epoch * num_epochs)
optimizer, schedule = create_optimizer(init_lr=2e-5, num_warmup_steps=0, num_train_steps=total_train_steps)

In [None]:
optimizer

<tf_keras.src.optimizers.adam.Adam at 0x7ebcd64b7520>

Then you can load DistilBERT with [TFAutoModelForSequenceClassification](https://huggingface.co/docs/transformers/main/en/model_doc/auto#transformers.TFAutoModelForSequenceClassification) along with the number of expected labels, and the label mappings:

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained(
    "distilbert-base-uncased", num_labels=2, id2label=id2label, label2id=label2id
)

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFDistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
Some weights or buffers of the TF 2.0 model TFDistilBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classifier.weight', 'classifier.bias']
You should 

Convert your datasets to the `tf.data.Dataset` format with [prepare_tf_dataset()](https://huggingface.co/docs/transformers/main/en/main_classes/model#transformers.TFPreTrainedModel.prepare_tf_dataset):

In [None]:
tf_train_set = model.prepare_tf_dataset(
    tokenized_news["train"],
    shuffle=True,
    batch_size=batch_size,
    collate_fn=data_collator,
)

tf_validation_set = model.prepare_tf_dataset(
    tokenized_news["test"],
    shuffle=False,
    batch_size=batch_size,
    collate_fn=data_collator,
)

In [None]:
# Print the number of data points in each set
print(f"Number of training examples: {len(tf_train_set)}")
print(f"Number of validation examples: {len(tf_validation_set)}")

Number of training examples: 38
Number of validation examples: 10


Configure the model for training with [`compile`](https://keras.io/api/models/model_training_apis/#compile-method). Note that Transformers models all have a default task-relevant loss function, so you don't need to specify one unless you want to:

In [None]:
import tensorflow as tf

model.compile(optimizer=optimizer)

In Keras, compiling the model means configuring the learning process. It's where you define the optimizer, loss function, and metrics that you want to use. The optimizer is the algorithm that adjusts the weights of the network to minimize the loss function.

The last two things to setup before you start training is to compute the accuracy from the predictions, and provide a way to push your model to the Hub. Both are done by using [Keras callbacks](https://huggingface.co/docs/transformers/main/en/tasks/../main_classes/keras_callbacks).

Pass your `compute_metrics` function to [KerasMetricCallback](https://huggingface.co/docs/transformers/main/en/main_classes/keras_callbacks#transformers.KerasMetricCallback):

In [None]:
from transformers.keras_callbacks import KerasMetricCallback

metric_callback = KerasMetricCallback(metric_fn=compute_metrics, eval_dataset=tf_validation_set)

In [None]:
from transformers.keras_callbacks import PushToHubCallback

push_to_hub_callback = PushToHubCallback(
    output_dir="good_news_detector",
    tokenizer=tokenizer,
)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/lmelos/good_news_detector into local empty directory.


In [None]:
callbacks= [metric_callback, push_to_hub_callback]

In [None]:
model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=callbacks)

Epoch 1/3


Cause: for/else statement not yet supported


Cause: for/else statement not yet supported
Epoch 2/3
Epoch 3/3


<tf_keras.src.callbacks.History at 0x7ebcd6243cd0>

## Inference

Great, now that we've finetuned a model, we can use it for inference!

I then choose a text I'd like to run inference on:

In [None]:
text = "Lima’s neurodivergent picnic movement is liberating Peruvians from stigma and abuse."

The simplest way to try out your finetuned model for inference is to use it in a [pipeline()](https://huggingface.co/docs/transformers/main/en/main_classes/pipelines#transformers.pipeline). Instantiate a `pipeline` for sentiment analysis with your model, and pass your text to it:


In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="lmelos/good_news_detector")
classifier(text)

config.json:   0%|          | 0.00/658 [00:00<?, ?B/s]

tf_model.h5:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some layers from the model checkpoint at lmelos/good_news_detector were not used when initializing TFDistilBertForSequenceClassification: ['dropout_39']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at lmelos/good_news_detector and are newly initialized: ['dropout_79']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.20k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[{'label': 'POSITIVE', 'score': 0.5502596497535706}]

You can also manually replicate the results of the `pipeline` if you'd like:

Tokenize the text and return TensorFlow tensors:

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("lmelos/good_news_detector")
inputs = tokenizer(text, return_tensors="tf")

Pass your inputs to the model and return the `logits`:

In [None]:
from transformers import TFAutoModelForSequenceClassification

model = TFAutoModelForSequenceClassification.from_pretrained("lmelos/good_news_detector")
logits = model(**inputs).logits

Some layers from the model checkpoint at lmelos/good_news_detector were not used when initializing TFDistilBertForSequenceClassification: ['dropout_39']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at lmelos/good_news_detector and are newly initialized: ['dropout_99']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Get the class with the highest probability, and use the model's `id2label` mapping to convert it to a text label:

In [None]:
predicted_class_id = int(tf.math.argmax(logits, axis=-1)[0])
model.config.id2label[predicted_class_id]

'POSITIVE'

In [None]:
text_negative = "PM says 'day of shame' as he apologises to blood scandal victims"

In [None]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis", model="lmelos/good_news_detector")
classifier(text_negative)

Some layers from the model checkpoint at lmelos/good_news_detector were not used when initializing TFDistilBertForSequenceClassification: ['dropout_39']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at lmelos/good_news_detector and are newly initialized: ['dropout_119']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


[{'label': 'NEGATIVE', 'score': 0.5490089058876038}]

## Conclusion

By following this guide, you will learn how to leverage the power of the DistilBERT model for text classification tasks. Whether you're interested in sentiment analysis or other forms of text classification, the techniques demonstrated here serve as a solid foundation for building and deploying NLP models in real-world applications.



---


