<a href="https://colab.research.google.com/github/koleshjr/Fake_News_Classification/blob/main/nlp_climate_Modelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Can You Model fake climate change data
Content

A dataset adopting the FEVER methodology that consists of 1535 real-world claims regarding climate-change collected on the internet. Each claim is accompanied by five manually annotated evidence sentences retrieved from the English Wikipedia that support, refute or do not give enough information to validate the claim totalling in 7675 claim-evidence pairs. The dataset features challenging claims that relate multiple facets and disputed cases of claims where both supporting and refuting evidence are present.

FIELDS

    claim_id: unique claim id
    claim: claim text
    claim_label: overall label assigned to claim (based on majority vote on evidences)
    evidences: top five evidence sentences
    evidence_id: unique evidence id
    evidence_label: micro-verdict label
    article: title of source article (Wikipedia page)
    evidence: evidence sentence
    entropy: entropy reflecting uncertainty of votes
    votes: array containing individual votes


### LOAD NECESSARY LIBRARIES

In [None]:
!pip install -q datasets
!pip install transformers
!pip install optuna
!pip install sentencepiece
!pip install evaluate


!pip install mlflow wandb



[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m18.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m27.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 KB[0m [31m14.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 KB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━

In [None]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


In [None]:
from pathlib import Path
path = '/content/drive/MyDrive/fake_news_classifier/'

In [None]:
import pandas as pd 
from sklearn.model_selection import train_test_split
import seaborn as sns 
from sklearn.metrics import accuracy_score, f1_score
import numpy as np
import mlflow
import os
import seaborn as sns
import torch
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import transformers 
import wandb

In [None]:
!wandb login 
wandb.init(project="classify_climate_change_propaganda")

[34m[1mwandb[0m: Currently logged in as: [33mkoleshjr[0m ([33mteam-kolesh[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Currently logged in as: [33mkoleshjr[0m ([33mteam-kolesh[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
%env WANDB_LOG_MODEL=true   #log every trained model

env: WANDB_LOG_MODEL=true   #log every trained model


In [None]:
transformers.set_seed(42)

In [None]:
df = pd.read_csv(path + 'climate-fever.csv')
df.head()

Unnamed: 0,claim_id,claim,claim_label,evidences/0/evidence_id,evidences/0/evidence_label,evidences/0/article,evidences/0/evidence,evidences/0/entropy,evidences/0/votes/0,evidences/0/votes/1,...,evidences/4/evidence_id,evidences/4/evidence_label,evidences/4/article,evidences/4/evidence,evidences/4/entropy,evidences/4/votes/0,evidences/4/votes/1,evidences/4/votes/2,evidences/4/votes/3,evidences/4/votes/4
0,0,Global warming is driving polar bears toward e...,SUPPORTS,Extinction risk from global warming:170,NOT_ENOUGH_INFO,Extinction risk from global warming,"""Recent Research Shows Human Activity Driving ...",0.693147,SUPPORTS,NOT_ENOUGH_INFO,...,Polar bear:1328,NOT_ENOUGH_INFO,Polar bear,"""Bear hunting caught in global warming debate"".",0.693147,SUPPORTS,NOT_ENOUGH_INFO,,,
1,5,The sun has gone into ‘lockdown’ which could c...,SUPPORTS,Famine:386,SUPPORTS,Famine,The current consensus of the scientific commun...,0.0,SUPPORTS,SUPPORTS,...,Winter:5,NOT_ENOUGH_INFO,Winter,"In many regions, winter is associated with sno...",0.693147,REFUTES,NOT_ENOUGH_INFO,,,
2,6,The polar bear population has been growing.,REFUTES,Polar bear:1332,NOT_ENOUGH_INFO,Polar bear,"""Ask the experts: Are polar bear populations i...",0.693147,NOT_ENOUGH_INFO,REFUTES,...,Polar bear:61,REFUTES,Polar bear,Of the 19 recognized polar bear subpopulations...,0.0,REFUTES,REFUTES,,,
3,9,Ironic' study finds more CO2 has slightly cool...,REFUTES,Atmosphere of Mars:131,NOT_ENOUGH_INFO,Atmosphere of Mars,CO2 in the mesosphere acts as a cooling agent ...,0.693147,NOT_ENOUGH_INFO,SUPPORTS,...,Carbon dioxide:191,NOT_ENOUGH_INFO,Carbon dioxide,"Less energy reaches the upper atmosphere, whic...",0.0,NOT_ENOUGH_INFO,NOT_ENOUGH_INFO,,,
4,10,Human additions of CO2 are in the margin of er...,REFUTES,Carbon dioxide in Earth's atmosphere:140,NOT_ENOUGH_INFO,Carbon dioxide in Earth's atmosphere,While CO 2 absorption and release is always ha...,0.693147,NOT_ENOUGH_INFO,REFUTES,...,Sea:226,REFUTES,Sea,"More recently, anthropogenic activities have s...",0.0,REFUTES,REFUTES,,,


structure of the dataset is we have the claim and the claim label plus a number of evidences 

In [None]:
df.shape

(1535, 53)

In [None]:
df.isnull().sum()

claim_id                         0
claim                            0
claim_label                      0
evidences/0/evidence_id          0
evidences/0/evidence_label       0
evidences/0/article              0
evidences/0/evidence             0
evidences/0/entropy              0
evidences/0/votes/0            181
evidences/0/votes/1              0
evidences/0/votes/2            918
evidences/0/votes/3           1530
evidences/0/votes/4           1349
evidences/1/evidence_id          0
evidences/1/evidence_label       0
evidences/1/article              0
evidences/1/evidence             0
evidences/1/entropy              0
evidences/1/votes/0            181
evidences/1/votes/1              0
evidences/1/votes/2            918
evidences/1/votes/3           1530
evidences/1/votes/4           1349
evidences/2/evidence_id          0
evidences/2/evidence_label       0
evidences/2/article              0
evidences/2/evidence             0
evidences/2/entropy              0
evidences/2/votes/0 

In [None]:
df[['claim','claim_label']].head()

Unnamed: 0,claim,claim_label
0,Global warming is driving polar bears toward e...,SUPPORTS
1,The sun has gone into ‘lockdown’ which could c...,SUPPORTS
2,The polar bear population has been growing.,REFUTES
3,Ironic' study finds more CO2 has slightly cool...,REFUTES
4,Human additions of CO2 are in the margin of er...,REFUTES


In [None]:
df['claim'][0]

'Global warming is driving polar bears toward extinction'

In [None]:
df['claim_label'].value_counts()

SUPPORTS           654
NOT_ENOUGH_INFO    474
REFUTES            253
DISPUTED           154
Name: claim_label, dtype: int64

In [None]:
df['label'] = np.where(df['claim_label']=='SUPPORTS','SUPPORTS','NOT_SUPPORT')

In [None]:
df['label'].value_counts()

NOT_SUPPORT    881
SUPPORTS       654
Name: label, dtype: int64

In [None]:
df =df[['claim','label']]
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['label'] = le.fit_transform(df['label'])
le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
print(le_name_mapping)


{'NOT_SUPPORT': 0, 'SUPPORTS': 1}


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['label'] = le.fit_transform(df['label'])


In [None]:
df['label'].value_counts()

0    881
1    654
Name: label, dtype: int64

### Training

In [None]:
model_nm = "bert-base-uncased"

tokz = AutoTokenizer.from_pretrained(model_nm)



model = AutoModelForSequenceClassification.from_pretrained(model_nm, num_labels=2, hidden_dropout_prob=0.2)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
df_train, df_temp = train_test_split(df, train_size=0.8, stratify = df['label'])
df_valid, df_test = train_test_split(df_temp, test_size=0.5,stratify = df_temp['label'])
df_train.shape, df_valid.shape, df_test.shape

((1228, 2), (153, 2), (154, 2))

In [None]:
adverse = DatasetDict({
    "train": Dataset.from_pandas(df_train),
    "valid": Dataset.from_pandas(df_valid),
    "test": Dataset.from_pandas(df_test)
    })

Tokenizing function

In [None]:
def tokenize(x): return tokz(x["claim"],truncation=True, padding=True, max_length=512)



adverse_encoded = adverse.map(tokenize, batched=True, batch_size=8) #, 


Map:   0%|          | 0/1228 [00:00<?, ? examples/s]

Map:   0%|          | 0/153 [00:00<?, ? examples/s]

Map:   0%|          | 0/154 [00:00<?, ? examples/s]

In [None]:
batch_size = 16
model_name = f"{model_nm}-finetuned-climate-fever"
training_args = TrainingArguments(output_dir=model_name,
                                  num_train_epochs=6,
                                  learning_rate= 4e-5,
                                  per_device_train_batch_size=batch_size,
                                  per_device_eval_batch_size=batch_size,
                                  weight_decay=0.03,
                                  evaluation_strategy="epoch",
                                logging_strategy="epoch",
                                  disable_tqdm=False,

                                  log_level="error",
                                 fp16 = True,  # for saving memory
                                 gradient_accumulation_steps=8, # for saving memory
                                  gradient_checkpointing=True, # for saving memory
                                 report_to="wandb")

In [None]:
def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    preds = np.argmax(predictions, axis=1)
    f1 = f1_score(labels, preds)
    acc = accuracy_score(labels, preds)
    return {"accuracy": acc, "f1": f1}

In [None]:
trainer = Trainer(model=model, args=training_args,
                  compute_metrics=compute_metrics,
                  train_dataset=adverse_encoded["train"],
                  eval_dataset=adverse_encoded["valid"],
                  tokenizer=tokz)
trainer.train();  





Epoch,Training Loss,Validation Loss,Accuracy,F1
0,0.7351,0.657416,0.633987,0.540984
1,0.6296,0.635458,0.673203,0.632353
2,0.6644,0.61203,0.686275,0.641791
3,0.5558,0.58668,0.72549,0.686567
4,0.5405,0.588941,0.72549,0.7
5,0.5478,0.590524,0.718954,0.695035


In [None]:
trainer.evaluate()

{'eval_loss': 0.5905235409736633,
 'eval_accuracy': 0.7189542483660131,
 'eval_f1': 0.6950354609929077,
 'eval_runtime': 0.3391,
 'eval_samples_per_second': 451.178,
 'eval_steps_per_second': 58.978,
 'epoch': 5.61}

In [None]:


wandb.finish()



### Saving the Model and Loading it

In [None]:
# Save the model
trainer.save_model("/content/drive/MyDrive/fake_news_classifier/best_bert_climate_model")



# save the model
save_dir = "ml-service/models/roberta-base"
tokenizer.save_pretrained(save_dir)
model.save_pretrained(save_dir)

### Push the model to hugging face

In [None]:
from huggingface_hub import notebook_login
notebook_login()

In [None]:
trainer.push_to_hub()

#### loading the model for inference

In [None]:
# Load the model
loaded_model = AutoModelForSequenceClassification.from_pretrained(
    "/content/drive/MyDrive/fake_news_classifier/best_bert_climate_model",
    num_labels=2
)

# Load the tokenizer
loaded_tokenizer = AutoTokenizer.from_pretrained(
    "/content/drive/MyDrive/fake_news_classifier/best_bert_climate_model",
)

# Max length
MAX_LENGTH = 512

In [None]:
example = "Trudeau's carbon tax will raise gas prices by 11 cents/litre."
example

"Trudeau's carbon tax will raise gas prices by 11 cents/litre."

In [None]:


# Our example text to pass to our fine tuned model
class_mapping = {'NOT_SUPPORT': 0, 'SUPPORTS': 1}
text = example

def get_result(text, message=True):
    encoded_input = loaded_tokenizer(text, truncation=True, padding='max_length',
                                     max_length=MAX_LENGTH, return_tensors='pt')
    output = loaded_model(**encoded_input)
    result = output[0].detach().numpy()
    probs = torch.sigmoid(output[0]).detach().numpy()
    class_label = argmax(result)
    
    predicted_label = list(class_mapping.keys())[list(class_mapping.values()).index(class_label)]
    
    if message:
        prediction_result = f'The predicted class is label: {predicted_label} with a probability of {probs[0][0]}'
    
    return prediction_result



# Run your result through the function
prediction_result = get_result(text)
prediction_result
     


'The predicted class is label: NOT_SUPPORT with a probability of 0.6365536451339722'

### Perform evaluation on the whole held_out test set

In [2]:
# Perform predictions on the test dataset
preds = trainer.predict(adverse_encoded["test"])



In [None]:
trainer.evaluate(adverse_encoded['test'])

{'eval_loss': 0.7349655628204346,
 'eval_accuracy': 0.5714285714285714,
 'eval_f1': 0.5352112676056339,
 'eval_runtime': 0.3448,
 'eval_samples_per_second': 446.635,
 'eval_steps_per_second': 58.005,
 'epoch': 5.61}

In [None]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded (0.000 MB deduped)\r'), FloatProgress(value=1.0, max…

0,1
eval/accuracy,▄▆▆████▁
eval/f1,▁▅▆▇███▁
eval/loss,▄▃▂▁▁▁▁█
eval/runtime,▇▁█▁▆▁▁▁
eval/samples_per_second,▁█▁█▂██▇
eval/steps_per_second,▁█▁█▂██▇
train/epoch,▁▁▃▃▄▄▆▆▇▇█████
train/global_step,▁▁▃▃▄▄▆▆▇▇█████
train/learning_rate,█▆▅▃▂▁
train/loss,█▄▅▂▁▁

0,1
eval/accuracy,0.57143
eval/f1,0.53521
eval/loss,0.73497
eval/runtime,0.3448
eval/samples_per_second,446.635
eval/steps_per_second,58.005
train/epoch,5.61
train/global_step,54.0
train/learning_rate,0.0
train/loss,0.5478
