<a href="https://colab.research.google.com/github/mustafiz-07/Data/blob/main/Fake_news_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Importing dependencies

In [17]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

##Importing dataset from kagglehub

In [18]:

import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "fake_news_dataset.csv"

df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "mahdimashayekhi/fake-news-detection-dataset",
  file_path,
)


  df = kagglehub.load_dataset(


In [19]:
print(df.columns)

Index(['title', 'text', 'date', 'source', 'author', 'category', 'label'], dtype='object')


##Data Pre-processing

In [27]:
df.shape

(20000, 5)

In [20]:
df.head()

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake


As the correlation of source and label are too low. I drop the ***source*** from dataset and also drop the ***date*** as it cannot be helpful for determine the target

In [21]:
df.drop(['source','date'],axis = 1,inplace = True)


In [23]:
df['category'].unique()

array(['Politics', 'Business', 'Science', 'Technology', 'Health',
       'Sports', 'Entertainment'], dtype=object)

In [31]:

# Filter the dataframe to only include rows where the 'label' column is 'real'
real_news_df = df[df['label'] == 'real']

# Count the occurrences of each category in the filtered dataframe
category_counts = real_news_df['category'].value_counts()

# Print the category counts
category_counts

Unnamed: 0_level_0,count
category,Unnamed: 1_level_1
Technology,1458
Health,1440
Entertainment,1429
Sports,1423
Science,1413
Politics,1403
Business,1378


In [61]:
fake=df[df['label']=='fake']
print(fake.shape)
fake['category'].value_counts()


(10056, 5)


KeyError: 'category'

the ***catagorical*** feature has almost equal number of real and fake label for every value. So this feature is also unuseful for detection

In [41]:
df.drop('category',axis = 1,inplace = True)

In [42]:
df.head()

Unnamed: 0,title,text,author,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,Paula George,real
1,To offer down resource great point.,probably guess western behind likely next inve...,Joseph Hill,fake
2,Himself church myself carry.,them identify forward present success risk sev...,Julia Robinson,fake
3,You unit its should.,phone which item yard Republican safe where po...,Mr. David Foster DDS,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,Austin Walker,fake


### Using Logistic Regression

In [58]:
df['author'] = df['author'].fillna('')
df['title'] = df['title'].fillna('')
df['text']=df['text'].fillna('')
df['content'] = df['author']+' '+df['title']+' '+df['text']
#seperating the data and the labels
X = df.drop('label',axis = 1)
Y = df['label']

#Stemming
ps = PorterStemmer()
def stemming(content):
  review = re.sub('[^a-zA-Z]',' ',content)
  review = review.lower()
  review = review.split()
  review = [ps.stem(word) for word in review if not word in stopwords.words('english')]
  review = ' '.join(review)
  return review

df['content'] = df['content'].apply(stemming)
#seperating the data and the labels
X = df['content'].values
Y = df['label'].values

#converting the textual data to numerical data
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)

#Splitting the data into training and test data
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size = 0.2, stratify = Y, random_state = 2)

#Training the model with Logistic Regression
model = LogisticRegression()
model.fit(X_train,Y_train)

#Evaluating the model with testing data
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(X_test_prediction,Y_test)
print("Test data accuracy: ", test_data_accuracy)

Test data accuracy:  0.514


### Using SVM model

In [59]:

from sklearn import svm

# Training the model with SVM
svm_model = svm.SVC()
svm_model.fit(X_train, Y_train)

# Evaluating the SVM model with testing data
X_test_prediction_svm = svm_model.predict(X_test)
test_data_accuracy_svm = accuracy_score(X_test_prediction_svm, Y_test)
print("SVM Test data accuracy:", test_data_accuracy_svm)


SVM Test data accuracy: 0.51525


Accuracy score using Logistic Regression is :**0.514**


Accuracy score using SVM is: **0.51525**