<a href="https://colab.research.google.com/github/kushalx05/fakenewsdetection/blob/main/Untitled0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
write a pytorch model for fake news prediction for the csv files ive given u..preprocess the data and use dataloaders and use efficient model to reduce the looss function

Here is all the data you need:
"True.csv"
"Fake.csv"

## Data loading

### Subtask:
Load the "True.csv" and "Fake.csv" files into separate pandas DataFrames.


**Reasoning**:
Load the "True.csv" and "Fake.csv" files into pandas DataFrames.



In [1]:
import pandas as pd

df_true = pd.read_csv('True.csv')
df_fake = pd.read_csv('Fake.csv')

## Data preparation

### Subtask:
Create PyTorch Datasets and DataLoaders for training, validation, and testing sets.


**Reasoning**:
Create a custom PyTorch Dataset class, instantiate three dataset objects for training, validation, and testing sets, and create PyTorch DataLoaders for each dataset.



In [2]:
import torch
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split

# Concatenate the true and fake dataframes
df_true['label'] = 1
df_fake['label'] = 0
df = pd.concat([df_true, df_fake], ignore_index=True)

# Split the data into training, validation, and testing sets
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.25, random_state=42)  # 0.25 x 0.8 = 0.2

# Define the custom PyTorch dataset class
class NewsDataset(Dataset):
    def __init__(self, texts, labels):
        self.texts = texts
        self.labels = labels

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts.iloc[idx]
        label = self.labels.iloc[idx]
        return text, label

# Create dataset objects for training, validation, and testing
train_dataset = NewsDataset(X_train, y_train)
val_dataset = NewsDataset(X_val, y_val)
test_dataset = NewsDataset(X_test, y_test)

# Create DataLoaders for each dataset
batch_size = 32
train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size)
test_dataloader = DataLoader(test_dataset, batch_size=batch_size)

# Print some examples from each dataset and DataLoader
print(f"Train Dataset Example: {train_dataset[0]}")
print(f"Val Dataset Example: {val_dataset[0]}")
print(f"Test Dataset Example: {test_dataset[0]}")

print("Training DataLoader Example:")
for batch_idx, (texts, labels) in enumerate(train_dataloader):
    print(f"Batch {batch_idx}: Text shape: {len(texts)}, Label shape: {len(labels)}")
    break

Train Dataset Example: ('BRASILIA (Reuters) - Brazil s top prosecutor on Friday charged six lawmakers from President Michel Temer s Brazilian Democracy Movement Party (PMDB) with forming a criminal organization, the latest in a barrage of charges in the country s sprawling corruption scandal. Those accused by prosecutor Rodrigo Janot in a filing with the Supreme Court include former senator and president Jose Sarney, the government s leader in the Senate Romero Juca and four other current senators. A corruption scandal involving cartels of companies bribing officials for public contracts has enveloped most of Brazil s political elite with Janot expected to issue another charge against Temer in coming weeks.  Temer defeated a first corruption charge from Janot, when the lower house of Congress voted not to allow it to proceed to trial. Separately on Friday, the Supreme Court said it would consider next week requests from Temer that would block Janot from issuing further charges. In the 

## Feature engineering

### Subtask:
Convert the preprocessed text data into numerical features using TF-IDF.


**Reasoning**:
Import TfidfVectorizer and convert the text data into numerical features using TF-IDF.



In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
import torch

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Fit the vectorizer on the training data
vectorizer.fit(X_train)

# Transform the training, validation, and test data into TF-IDF feature vectors
X_train_tfidf = vectorizer.transform(X_train)
X_val_tfidf = vectorizer.transform(X_val)
X_test_tfidf = vectorizer.transform(X_test)

# Convert the TF-IDF feature vectors into PyTorch tensors
X_train_tfidf = torch.from_numpy(X_train_tfidf.toarray()).float()
X_val_tfidf = torch.from_numpy(X_val_tfidf.toarray()).float()
X_test_tfidf = torch.from_numpy(X_test_tfidf.toarray()).float()

## Model training

### Subtask:
Train a feedforward neural network model for fake news prediction.


**Reasoning**:
Define and train a feedforward neural network model using the prepared data.



In [4]:
import torch
import torch.nn as nn
import torch.optim as optim

# Define the neural network model
class FFNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(FFNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(hidden_size, output_size)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        x = self.sigmoid(x)
        return x

# Define model hyperparameters
input_size = X_train_tfidf.shape[1]
hidden_size = 128
output_size = 1
learning_rate = 0.001
num_epochs = 10

# Create the model, loss function, and optimizer
model = FFNN(input_size, hidden_size, output_size)
criterion = nn.BCELoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# Training loop
for epoch in range(num_epochs):
    model.train()
    for batch_idx, (texts, labels) in enumerate(train_dataloader):
        # Convert labels to PyTorch tensors
        labels = labels.float().unsqueeze(1)
        # Convert text data to PyTorch tensors
        texts_tfidf = vectorizer.transform(texts)
        texts_tfidf = torch.from_numpy(texts_tfidf.toarray()).float()

        # Forward pass
        outputs = model(texts_tfidf)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Print training progress
        if (batch_idx + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{batch_idx + 1}/{len(train_dataloader)}], Loss: {loss.item():.4f}')

    # Evaluation on the validation set
    model.eval()
    with torch.no_grad():
        correct = 0
        total = 0
        for texts, labels in val_dataloader:
            # Convert labels to PyTorch tensors
            labels = labels.float().unsqueeze(1)
            # Convert text data to PyTorch tensors
            texts_tfidf = vectorizer.transform(texts)
            texts_tfidf = torch.from_numpy(texts_tfidf.toarray()).float()

            outputs = model(texts_tfidf)
            predicted = (outputs > 0.5).float()
            total += labels.size(0)
            correct += (predicted == labels).sum().item()

        accuracy = 100 * correct / total
        print(f'Epoch [{epoch + 1}/{num_epochs}], Validation Accuracy: {accuracy:.2f}%')

Epoch [1/10], Step [100/842], Loss: 0.2280
Epoch [1/10], Step [200/842], Loss: 0.1219
Epoch [1/10], Step [300/842], Loss: 0.0363
Epoch [1/10], Step [400/842], Loss: 0.0727
Epoch [1/10], Step [500/842], Loss: 0.1408
Epoch [1/10], Step [600/842], Loss: 0.0538
Epoch [1/10], Step [700/842], Loss: 0.0310
Epoch [1/10], Step [800/842], Loss: 0.0987
Epoch [1/10], Validation Accuracy: 98.65%
Epoch [2/10], Step [100/842], Loss: 0.0183
Epoch [2/10], Step [200/842], Loss: 0.0287
Epoch [2/10], Step [300/842], Loss: 0.0096
Epoch [2/10], Step [400/842], Loss: 0.0087
Epoch [2/10], Step [500/842], Loss: 0.0046
Epoch [2/10], Step [600/842], Loss: 0.0088
Epoch [2/10], Step [700/842], Loss: 0.0180
Epoch [2/10], Step [800/842], Loss: 0.0120
Epoch [2/10], Validation Accuracy: 98.81%
Epoch [3/10], Step [100/842], Loss: 0.0103
Epoch [3/10], Step [200/842], Loss: 0.0086
Epoch [3/10], Step [300/842], Loss: 0.0032
Epoch [3/10], Step [400/842], Loss: 0.0131
Epoch [3/10], Step [500/842], Loss: 0.0005
Epoch [3/10],

## Model evaluation

### Subtask:
Evaluate the trained model on the test dataset.


**Reasoning**:
Evaluate the trained model on the test dataset using the test_dataloader and calculate accuracy, precision, recall, and F1-score.



In [5]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

model.eval()
y_true = []
y_pred = []

with torch.no_grad():
    for texts, labels in test_dataloader:
        # Convert labels to PyTorch tensors
        labels = labels.float().unsqueeze(1)
        # Convert text data to PyTorch tensors
        texts_tfidf = vectorizer.transform(texts)
        texts_tfidf = torch.from_numpy(texts_tfidf.toarray()).float()

        outputs = model(texts_tfidf)
        predicted = (outputs > 0.5).float()

        y_true.extend(labels.cpu().numpy().tolist())
        y_pred.extend(predicted.cpu().numpy().tolist())

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)
f1 = f1_score(y_true, y_pred)

print(f"Test Accuracy: {accuracy:.4f}")
print(f"Test Precision: {precision:.4f}")
print(f"Test Recall: {recall:.4f}")
print(f"Test F1-Score: {f1:.4f}")

Test Accuracy: 0.9898
Test Precision: 0.9882
Test Recall: 0.9905
Test F1-Score: 0.9894


## Summary:

## Fake News Prediction Model Analysis Summary

### 1. Q&A

**Q: What is the performance of the trained model on the test dataset?**
**A:** The model achieved high accuracy (0.9898), precision (0.9882), recall (0.9905), and F1-score (0.9894) on the test dataset, indicating strong performance in distinguishing between true and fake news.

**Q: What model was used for fake news prediction?**
**A:** A feedforward neural network (FFNN) with a single hidden layer was implemented using PyTorch.

**Q: What was the validation accuracy during training?**
**A:** The model achieved a validation accuracy of around 98.95% after 10 epochs of training.


### 2. Data Analysis Key Findings

* The model achieved a high test accuracy of **0.9898**, precision of **0.9882**, recall of **0.9905**, and F1-score of **0.9894**.
* The model's validation accuracy during training reached approximately **98.95%** after 10 epochs.
* TF-IDF vectorization was used to convert text data into numerical features with a maximum of **5000** features.
* The FFNN model had a hidden layer size of **128** and used Adam optimizer with a learning rate of **0.001**.

### 3. Insights or Next Steps

* The developed model demonstrates strong performance in classifying fake news with high accuracy, precision, recall, and F1-score.
* Further exploration could involve experimenting with different model architectures, hyperparameters, or feature engineering techniques to potentially improve the model's performance even further.


In [6]:
y_true = []
y_pred = []
with torch.inference_mode():
  model.eval()
  for texts, labels in test_dataloader:
        # Convert labels to PyTorch tensors
        labels = labels.float().unsqueeze(1)
        # Convert text data to PyTorch tensors
        texts_tfidf = vectorizer.transform(texts)
        texts_tfidf = torch.from_numpy(texts_tfidf.toarray()).float()

        outputs = model(texts_tfidf)
        predicted = (outputs > 0.5).float()

        y_true.extend(labels.cpu().numpy().tolist())
        y_pred.extend(predicted.cpu().numpy().tolist())
  accuracy = accuracy_score(y_true, y_pred)
  precision = precision_score(y_true, y_pred)
  recall = recall_score(y_true, y_pred)
  f1 = f1_score(y_true, y_pred)

  print(f"Test Accuracy: {accuracy:.4f}")
  print(f"Test Precision: {precision:.4f}")
  print(f"Test Recall: {recall:.4f}")
  print(f"Test F1-Score: {f1:.4f}")


Test Accuracy: 0.9898
Test Precision: 0.9882
Test Recall: 0.9905
Test F1-Score: 0.9894


In [12]:
user_input = input("Enter a news article: ")

# Preprocess the user input using the TF-IDF vectorizer
user_input_tfidf = vectorizer.transform([user_input])
user_input_tfidf = torch.from_numpy(user_input_tfidf.toarray()).float()

# Make a prediction
with torch.no_grad():  # Disable gradient calculation during inference
  model.eval()  # Set the model to evaluation mode
  prediction = model(user_input_tfidf)
  predicted_class = (prediction > 0.5).float().item()  # Get the predicted class (0 or 1)

# Display the result
if predicted_class == 0:
  print("The news is likely to be fake.")
else:
  print("The news is likely to be true.")

Enter a news article: "Local Government Announces New Park Development  The City Council of Anytown has approved plans for a new park to be built on the former industrial site at 123 Main Street. The park will feature walking trails, a playground, and a community garden. Construction is expected to begin next spring and be completed within a year. Funding for the project comes from a combination of city bonds and private donations."
The news is likely to be fake.
