## Binary classification

| Experiment | Date | Details | Macro F1 | Comments |
| --- | --- | --- | --- | --- |
| MalayalamBERT finetune | 2024-23-12 | just finetuned for 2 class classification | .8196 |  |
| GPT4o and slightly better prompt | 2025-01-02 | wrote a prompt and gave 100 random training examples as context | .7823 |  |



<!-- | GPT4o mini prompt | 2025-01-02 | wrote a prompt and gave 100 random training examples as context | .6848 |  |
| GPT4o and slightly better prompt | 2025-01-02 | wrote a prompt and gave 100 random training examples as context | .7448 |  | -->

### Malayalam BERT finetune (run this section until the end to generate the submission)

In [5]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Load the pretrained Malayalam BERT model and tokenizer
model_name = "l3cube-pune/malayalam-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Load datasets
train_data = pd.read_csv("Fake_train.csv")
valid_data = pd.read_csv("Fake_dev.csv")

# Preprocess datasets: Tokenize text and encode labels
def preprocess(data):
    tokenized = tokenizer(list(data["text"]), truncation=True, padding="max_length", max_length=512)
    tokenized["label"] = data["label"].map({"original": 0, "Fake": 1}).values
    return tokenized

train_dataset = Dataset.from_dict(preprocess(train_data))
valid_dataset = Dataset.from_dict(preprocess(valid_data))

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    # load_best_model_at_end=True,
    save_strategy="no",
    save_total_limit=2,
    metric_for_best_model="macro_avg_f1"

)

# Define metrics for evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    report = classification_report(labels, preds, target_names=["original", "Fake"], output_dict=True)
    return {
        "accuracy": report["accuracy"],
        "precision_original": report["original"]["precision"],
        "recall_original": report["original"]["recall"],
        "f1_original": report["original"]["f1-score"],
        "precision_Fake": report["Fake"]["precision"],
        "recall_Fake": report["Fake"]["recall"],
        "f1_Fake": report["Fake"]["f1-score"],
        "macro_avg_f1": 0.5 * (report["Fake"]["f1-score"] + report["original"]["f1-score"])
    }

# Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate on the train and validation datasets
train_results = trainer.evaluate(train_dataset)
valid_results = trainer.evaluate(valid_dataset)


  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at l3cube-pune/malayalam-bert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Accuracy,Precision Original,Recall Original,F1 Original,Precision Fake,Recall Fake,F1 Fake,Macro Avg F1
1,No log,0.690118,0.49816,0.5,0.00978,0.019185,0.498141,0.990148,0.662819,0.341002
2,0.688300,0.666589,0.741104,0.740291,0.745721,0.742996,0.741935,0.736453,0.739184,0.74109
3,0.688300,0.622977,0.777914,0.770142,0.794621,0.78219,0.78626,0.761084,0.773467,0.777828
4,0.637800,0.593057,0.802454,0.778027,0.848411,0.811696,0.831978,0.756158,0.792258,0.801977
5,0.637800,0.585669,0.802454,0.760504,0.885086,0.818079,0.861357,0.719212,0.783893,0.800986






In [19]:
test_data = pd.read_csv("fake_test_binary_with_labels.csv") #pd.read_csv("Fake_train.csv") #
# test_data["label"] = ["Fake"] * 510 + ["original"] * 509
test_dataset = Dataset.from_dict(preprocess(test_data))

In [20]:
test_preds = np.argmax(trainer.predict(test_dataset).predictions, axis=1)



In [21]:
from sklearn.metrics import classification_report
print(classification_report(y_pred=test_preds, y_true=1 * (test_data["label"] == "Fake"), digits=4))

              precision    recall  f1-score   support

           0     0.7709    0.8809    0.8222       512
           1     0.8594    0.7357    0.7928       507

    accuracy                         0.8086      1019
   macro avg     0.8152    0.8083    0.8075      1019
weighted avg     0.8150    0.8086    0.8076      1019



In [26]:
labels = ["original", "Fake"]
test_data["prediction"] = [labels[int(x)] for x in test_preds]

In [29]:
test_data.drop(columns=["label"]).to_csv("lowes_task1_run.tsv",sep="\t", index=False)

In [None]:
# # Use the custom Trainer
# trainer = CustomTrainer(
#     model=model,
#     args=training_args,
#     train_dataset=train_dataset,
#     eval_dataset=valid_dataset,
#     tokenizer=tokenizer,
#     compute_metrics=compute_metrics,
# )

## Fake news (multiple classes)

| Experiment | Date | Details | Macro F1 | Comments |
| --- | --- | --- | --- | --- |
| MalayalamBERT finetune | 2024-23-12 | just finetuned for multi-class classification | .2153 |  |
|  +focal loss raised to 0.1 weights and | |  | .2153 |  |
|  GPT 4o | |  | .2156 |  |


### Malayalam BERT finetune

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Load datasets
multi_class_data = pd.read_csv("fake_news_classification_mal_train.csv")
multi_class_data["Label"] = multi_class_data["Label"].apply(lambda x: x.strip())
class_names = sorted(list(multi_class_data["Label"].unique()))
train_data, valid_data = train_test_split(multi_class_data, test_size=0.3, random_state=42)

test_data = pd.read_csv("fake_test_multiclass_labeled.csv")
test_data["Label"] = test_data["Label"].apply(lambda x: x.strip())
valid_data = test_data

# Load the pretrained Malayalam BERT model and tokenizer
model_name = "l3cube-pune/malayalam-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(class_names))

# Preprocess datasets: Tokenize text and encode labels
def preprocess(data):
    tokenized = tokenizer(list(data["News"]), truncation=True, padding="max_length", max_length=256)
    tokenized["label"] = data["Label"].map({class_name: i for i, class_name in enumerate(class_names)}).values
    return tokenized

train_dataset = Dataset.from_dict(preprocess(train_data))
valid_dataset = Dataset.from_dict(preprocess(valid_data))

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-6,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    # load_best_model_at_end=True,
    save_strategy="no",
    save_total_limit=2,
    metric_for_best_model="macro_avg_f1"

)

# Define metrics for evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    report = classification_report(labels, preds, target_names=class_names, output_dict=True)
    return {
        "macro_avg_f1":np.mean([report[class_name]["f1-score"] for class_name in class_names])
    }

# Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate on the train and validation datasets
train_results = trainer.evaluate(train_dataset)
valid_results = trainer.evaluate(valid_dataset)


  from .autonotebook import tqdm as notebook_tqdm
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at l3cube-pune/malayalam-bert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  trainer = Trainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Macro Avg F1
1,No log,1.387423,0.031564
2,No log,1.386181,0.163474
3,No log,1.385138,0.165552
4,No log,1.384433,0.166667
5,No log,1.384158,0.166667


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [2]:
train_results

{'eval_loss': 1.3812459707260132,
 'eval_macro_avg_f1': 0.2089715536105033,
 'eval_runtime': 6.8906,
 'eval_samples_per_second': 193.018,
 'eval_steps_per_second': 0.871,
 'epoch': 5.0}

In [3]:
valid_results

{'eval_loss': 1.3841580152511597,
 'eval_macro_avg_f1': 0.16666666666666666,
 'eval_runtime': 1.094,
 'eval_samples_per_second': 182.818,
 'eval_steps_per_second': 0.914,
 'epoch': 5.0}

In [None]:
import torch
torch.cuda.empty_cache()

## Finetune with synthetic data

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Load datasets
multi_class_data = pd.concat([
    pd.read_csv("fake_news_classification_mal_train.csv"), 
    pd.read_csv("fake_news_classification_mal_train_synthetic_data.csv")
], axis=0)
multi_class_data.drop_duplicates(subset=["News"], inplace=True)
multi_class_data["Label"] = multi_class_data["Label"].apply(lambda x: x.strip())
class_names = sorted(list(multi_class_data["Label"].unique()))
train_data, valid_data = train_test_split(multi_class_data, test_size=0.3, random_state=42)

# Load the pretrained Malayalam BERT model and tokenizer
model_name = "l3cube-pune/malayalam-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(class_names))

# Preprocess datasets: Tokenize text and encode labels
def preprocess(data):
    tokenized = tokenizer(list(data["News"]), truncation=True, padding="max_length", max_length=256)
    tokenized["label"] = data["Label"].map({class_name: i for i, class_name in enumerate(class_names)}).values
    return tokenized

train_dataset = Dataset.from_dict(preprocess(train_data))
valid_dataset = Dataset.from_dict(preprocess(valid_data))

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    # load_best_model_at_end=True,
    save_strategy="no",
    save_total_limit=2,
    metric_for_best_model="macro_avg_f1"

)
pred_names = []
# Define metrics for evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    pred_names = [class_names[i] for i in preds]
    report = classification_report(labels, preds, target_names=class_names, output_dict=True)
    return {
        "macro_avg_f1":np.mean([report[class_name]["f1-score"] for class_name in class_names])
    }

# Trainer object
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate on the train and validation datasets
train_results = trainer.evaluate(train_dataset)
valid_results = trainer.evaluate(valid_dataset)
valid_data["pred"] = pred_names
valid_data.to_csv("./validation_preds.csv")

In [12]:
valid_results

{'eval_loss': 1.3792507648468018,
 'eval_macro_avg_f1': 0.18357487922705315,
 'eval_runtime': 3.5984,
 'eval_samples_per_second': 218.428,
 'eval_steps_per_second': 1.112,
 'epoch': 5.0}

In [10]:
multi_class_data.shape

(2372, 3)

### Malayalam BERT finetune with focal loss

In [8]:
import pandas as pd
import numpy as np
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
import torch
from torch.nn import functional as F

# Load datasets
multi_class_data = pd.read_csv("fake_news_classification_mal_train.csv")
multi_class_data["Label"] = multi_class_data["Label"].apply(lambda x: x.strip())
class_names = sorted(list(multi_class_data["Label"].unique()))
train_data, valid_data = train_test_split(multi_class_data, test_size=0.3, random_state=42)
class_weights = [1 / (train_data["Label"].value_counts()[class_name]) ** 0.1 for class_name in class_names]

# Load the pretrained Malayalam BERT model and tokenizer
model_name = "l3cube-pune/malayalam-bert"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(class_names))

# Preprocess datasets: Tokenize text and encode labels
def preprocess(data):
    tokenized = tokenizer(list(data["News"]), truncation=True, padding="max_length", max_length=256)
    tokenized["label"] = data["Label"].map({class_name: i for i, class_name in enumerate(class_names)}).values
    return tokenized

train_dataset = Dataset.from_dict(preprocess(train_data))
valid_dataset = Dataset.from_dict(preprocess(valid_data))

# Focal Loss Implementation
class FocalLoss(torch.nn.Module):
    def __init__(self, gamma=2, alpha=None, reduction="mean"):
        super(FocalLoss, self).__init__()
        self.gamma = gamma
        self.alpha = alpha  # Can be used to give weights to classes
        self.reduction = reduction

    def forward(self, logits, targets):
        probs = F.softmax(logits, dim=-1)  # Convert logits to probabilities
        targets_one_hot = F.one_hot(targets, num_classes=logits.size(-1)).float()
        pt = (probs * targets_one_hot).sum(dim=-1)  # Probability of the true class
        log_pt = torch.log(pt + 1e-12)  # Add small value to avoid log(0)
        focal_loss = -(1 - pt) ** self.gamma * log_pt  # Apply the focal loss formula

        # Apply class weights if provided
        if self.alpha is not None:
            alpha_t = self.alpha.gather(0, targets)
            focal_loss = focal_loss * alpha_t

        # Reduction
        if self.reduction == "mean":
            return focal_loss.mean()
        elif self.reduction == "sum":
            return focal_loss.sum()
        else:
            return focal_loss

# Define custom Trainer with Focal Loss
class CustomTrainer(Trainer):
    def compute_loss(self, model, inputs, return_outputs=False, num_items_in_batch=None):
        # print(f"Unexpected kwargs: {kwargs}")  # Log unexpected arguments
        labels = inputs.pop("labels")
        outputs = model(**inputs)
        logits = outputs.logits
        loss_fn = FocalLoss(gamma=0.1, alpha=torch.tensor(class_weights).to("cuda"))  # Optionally adjust alpha
        loss = loss_fn(logits, labels)
        return (loss, outputs) if return_outputs else loss

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=5e-6,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=64,
    num_train_epochs=5,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=50,
    save_strategy="no",
    save_total_limit=2,
    metric_for_best_model="macro_avg_f1"
)

# Define metrics for evaluation
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    report = classification_report(labels, preds, target_names=class_names, output_dict=True)
    return {
        "macro_avg_f1": np.mean([report[class_name]["f1-score"] for class_name in class_names])
    }

# Use the custom Trainer
trainer = CustomTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

# Evaluate on the train and validation datasets
train_results = trainer.evaluate(train_dataset)
valid_results = trainer.evaluate(valid_dataset)

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at l3cube-pune/malayalam-bert and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight', 'classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
  trainer = CustomTrainer(
Detected kernel version 4.18.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.


Epoch,Training Loss,Validation Loss,Macro Avg F1
1,No log,0.711576,0.189386
2,No log,0.71039,0.215285
3,No log,0.709303,0.215285
4,No log,0.708577,0.215285
5,No log,0.708304,0.215285


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


##  Fasttext model train

In [18]:
def csv_to_fasttext(input_df, output_txt, text_col="News", label_col="Label"):
    with open(output_txt, 'w') as f:
        for _, row in input_df.iterrows():
            label = f"__label__{row[label_col]}"
            text = row[text_col].replace("\n", " ")  # Remove newline characters
            f.write(f"{label} {text}\n")

In [19]:
import fasttext

# Paths to dataset files
train_file = "train.txt"
validation_file = "validation.txt"

csv_to_fasttext(train_data, train_file)
csv_to_fasttext(validation_data, validation_file)

# Train the FastText model
model = fasttext.train_supervised(
    input=train_file,
    lr=1.0,  # Learning rate
    epoch=25,  # Number of epochs
    wordNgrams=2,  # Use word n-grams
    verbose=2,  # Verbosity level
    loss="softmax"  # Loss function
)

# Evaluate on the validation set
validation_result = model.test(validation_file)
print("\nValidation Results:")
print(f"Precision: {validation_result.precision:.4f}")
print(f"Recall: {validation_result.recall:.4f}")
print(f"Number of samples: {validation_result.nexamples}")

# Evaluate on the test set
test_result = model.test(test_file)
print("\nTest Results:")
print(f"Precision: {test_result.precision:.4f}")
print(f"Recall: {test_result.recall:.4f}")
print(f"Number of samples: {test_result.nexamples}")

# Save the model
model.save_model("fasttext_model.bin")

# Example of predicting labels for new texts
texts = ["This is an example text.", "Another example sentence."]
predictions = [model.predict(text) for text in texts]
for text, prediction in zip(texts, predictions):
    print(f"Text: {text}")
    print(f"Prediction: {prediction}")


NameError: name 'validation_data' is not defined

In [31]:
pd.read_excel("fake_news_classification_mal_test.xlsx")

ImportError: Missing optional dependency 'openpyxl'.  Use pip or conda to install openpyxl.

## Paper Writing

In [3]:
import pandas as pd
train_data = pd.read_csv("Fake_train.csv")
valid_data = pd.read_csv("Fake_dev.csv")
test_data = pd.read_csv("fake_test_binary_with_labels.csv")

In [4]:
train_data["label"].value_counts(), valid_data["label"].value_counts(), test_data["label"].value_counts()

(label
 original    1658
 Fake        1599
 Name: count, dtype: int64,
 label
 original    409
 Fake        406
 Name: count, dtype: int64,
 label
 original    512
 Fake        507
 Name: count, dtype: int64)

In [11]:
# Simple tfidf + lr

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

# Load the datasets
train_df = pd.read_csv('Fake_train.csv')
test_df = pd.read_csv('fake_test_binary_with_labels.csv')

# Prepare the data
X_train = train_df['text']
y_train = train_df['label']
X_test = test_df['text']
y_test = test_df['label']

# Optionally, split train set further to validate performance on a smaller train split
# X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Create the TF-IDF + Logistic Regression pipeline
pipeline = make_pipeline(
    TfidfVectorizer(),
    LogisticRegression(max_iter=1000)  # Adjust max_iter if you run into convergence issues
)

# Train the model
pipeline.fit(X_train, y_train)

# Predict on train and test data
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

# Evaluate using macro F1 score
train_f1 = f1_score(y_train, y_train_pred, average='macro')
test_f1 = f1_score(y_test, y_test_pred, average='macro')

print(f"Train Macro F1 Score: {train_f1:.4f}")
print(f"Test Macro F1 Score: {test_f1:.4f}")

Train Macro F1 Score: 0.9278
Test Macro F1 Score: 0.7690


In [13]:
import pandas as pd
import fasttext
from sklearn.metrics import f1_score

# Load datasets
train_df = pd.read_csv('Fake_train.csv')
test_df = pd.read_csv('fake_test_binary_with_labels.csv')

# Prepare the data (fastText expects labels to start with '__label__' prefix)
def preprocess_for_fasttext(df, filename):
    with open(filename, 'w') as f:
        for _, row in df.iterrows():
            # fastText expects the label to be in the form '__label__<label>'
            f.write(f"__label__{row['label']} {row['text']}\n")

# Preprocess and save data in fastText's format
preprocess_for_fasttext(train_df, 'train.ft')
preprocess_for_fasttext(test_df, 'test.ft')

# Train a fastText classifier
model = fasttext.train_supervised(input='train.ft', epoch=25, lr=0.1, wordNgrams=2)

# Make predictions on train and test data
def predict_fasttext(model, df):
    predictions = []
    for _, row in df.iterrows():
        # Predict the label for the text using the trained model
        labels, _ = model.predict(row['text'])  # fastText outputs tuple (labels, probabilities)
        predictions.append(labels[0].replace('__label__', ''))  # Get the first label and remove the prefix
    return predictions

y_train_pred = predict_fasttext(model, train_df)
y_test_pred = predict_fasttext(model, test_df)

# Calculate macro F1 score
train_f1 = f1_score(train_df['label'], y_train_pred, average='macro')
test_f1 = f1_score(test_df['label'], y_test_pred, average='macro')

# Print the results
print(f"Train Macro F1 Score: {train_f1:.4f}")
print(f"Test Macro F1 Score: {test_f1:.4f}")


Read 0M words
Number of words:  19466
Number of labels: 2
Progress: 100.0% words/sec/thread:   21976 lr:  0.000000 avg.loss:  0.242259 ETA:   0h 0m 0s


ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.

In [23]:
pd.read_csv("fake_news_classification_mal_train.csv")["Label"].value_counts()

Label
FALSE           1220
MOSTLY FALSE     295
FALSE            166
HALF TRUE        162
PARTLY FALSE      57
Name: count, dtype: int64

In [24]:
pd.read_csv("fake_test_multiclass_labeled.csv")["Label"].value_counts()

Label
FALSE           100
MOSTLY FALSE     56
HALF TRUE        37
PARTLY FALSE      7
Name: count, dtype: int64