# 1. Fake news classifier

**A text classification model to detect fake news articles!**

* I downloaded the dataset from here: https://www.kaggle.com/datasets/sadikaljarif/fake-news-detection-dataset-english
* I Developed an NLP model for classification that uses a pretrained language model and the *text* of the article.
* I finetuned the language model on the dataset, and generated an AUC curve of the model on the test set. 
* I also [Uploaded the model to the Hugging Face Hub](https://huggingface.co/h-pal/bert-fake-news-classification-fine-tuned).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.27.1-py3-none-any.whl (6.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.7/6.7 MB[0m [31m56.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.2-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m77.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.2 tokenizers-0.13.2 transformers-4.27.1


In [None]:
import pandas as pd
from sklearn.metrics import roc_auc_score
from transformers import BertTokenizer, BertForSequenceClassification, AdamW, get_linear_schedule_with_warmup
import torch
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
df_fake = pd.read_csv('/content/drive/MyDrive/fake-news-data/Fake.csv/Fake.csv')
df_true = pd.read_csv('/content/drive/MyDrive/fake-news-data/True.csv/True.csv')

*   1-----> fake news
*   0-----> true news



In [None]:
df_fake['label'] = 1
df_true['label'] = 0

In [None]:
df = pd.concat([df_fake, df_true], axis=0)
df = df.sample(frac=1).reset_index(drop=True)

In [None]:
df.isna().sum()

title      0
text       0
subject    0
date       0
label      0
dtype: int64

In [None]:
train_df, test_df = train_test_split(df, test_size=0.15, random_state=42)

print('Train set shape:', train_df.shape)
print('Test set shape:', test_df.shape)

Train set shape: (38163, 5)
Test set shape: (6735, 5)


In [None]:
x_train, y_train = train_df['text'],train_df['label']
x_test, y_test = test_df['text'],test_df['label']

In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',do_lower_case=True)
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
train_encodings = tokenizer(list(x_train), truncation=True, padding=True, max_length=512)
test_encodings = tokenizer(list(x_test), truncation=True, padding=True, max_length=512)

train_labels = np.array(list(y_train))
test_labels = np.array(list(y_test))

# Convert the data to PyTorch tensors
train_dataset = torch.utils.data.TensorDataset(torch.tensor(train_encodings['input_ids']),
                                               torch.tensor(train_encodings['attention_mask']),
                                               torch.tensor(train_labels))

test_dataset = torch.utils.data.TensorDataset(torch.tensor(test_encodings['input_ids']),
                                              torch.tensor(test_encodings['attention_mask']),
                                              torch.tensor(test_labels))

In [None]:
# Set up the optimizer and learning rate scheduler
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
total_steps = len(train_dataset) * 3
scheduler = get_linear_schedule_with_warmup(optimizer, num_warmup_steps=0, num_training_steps=total_steps)



In [None]:
# Set up the device (GPU or CPU)
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
model.to(device)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, element

In [None]:
# Train the model
batch_size = 8
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
for epoch in range(3):
    print('Epoch {}/{}'.format(epoch + 1, 3))
    print('-' * 10)
    total_loss = 0
    model.train()

    for batch in train_loader:
        batch = tuple(t.to(device) for t in batch)
        inputs = {'input_ids': batch[0],
                  'attention_mask': batch[1],
                  'labels': batch[2]}
        optimizer.zero_grad()
        outputs = model(**inputs)
        loss = outputs[0]
        total_loss += loss.item()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    print('Average Training Loss: {:.4f}'.format(total_loss / len(train_loader)))

Epoch 1/3
----------
Average Training Loss: 0.0093
Epoch 2/3
----------
Average Training Loss: 0.0035
Epoch 3/3
----------
Average Training Loss: 0.0017


In [None]:
# Evaluate the model
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True)
test_probs = []
test_labels = []

model.eval()
for batch in test_loader:
    batch = tuple(t.to(device) for t in batch)
    inputs = {'input_ids': batch[0],
              'attention_mask': batch[1]}
    labels = batch[2]
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs[0]
    probs = torch.softmax(logits, dim=1)
    test_probs.extend(probs[:, 1].cpu().numpy().tolist())
    test_labels.extend(labels.cpu().numpy().tolist())

auc_score = roc_auc_score(test_labels, test_probs)
print('AUC Score:', auc_score)

AUC Score: 1.0


In [None]:
model.save_pretrained('fine-tuned-model')
tokenizer.save_pretrained('fine-tuned-tokenizer')

('fine-tuned-tokenizer/tokenizer_config.json',
 'fine-tuned-tokenizer/special_tokens_map.json',
 'fine-tuned-tokenizer/vocab.txt',
 'fine-tuned-tokenizer/added_tokens.json')

In [None]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
!huggingface-cli repo create bert-fake-news-classification-fine-tuned

[90mgit version 2.25.1[0m
[90mgit-lfs/2.9.2 (GitHub; linux amd64; go 1.13.5)[0m

You are about to create [1mh-pal/bert-fake-news-classification-fine-tuned[0m
Proceed? [Y/n] Y

Your repo now lives at:
  [1mhttps://huggingface.co/h-pal/bert-fake-news-classification-fine-tuned[0m

You can clone it locally with the command below, and commit/push as usual.

  git clone https://huggingface.co/h-pal/bert-fake-news-classification-fine-tuned



In [None]:
model.push_to_hub('bert-fake-news-classification-fine-tuned')
tokenizer.push_to_hub('bert-fake-news-classification-fine-tuned')

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/h-pal/bert-fake-news-classification-fine-tuned/commit/d6d35288a300113a40f61044bae11631c9b3f8b7', commit_message='Upload tokenizer', commit_description='', oid='d6d35288a300113a40f61044bae11631c9b3f8b7', pr_url=None, pr_revision=None, pr_num=None)

Link to my fine-tuned model https://huggingface.co/h-pal/bert-fake-news-classification-fine-tuned