**About the Dataset:**

title- A short headline summarizing the article (around 6 words).

text- The body of the news article (250 words on average).

date- The publication date of the article, randomly selected over the past 3 years.

source- The media source that published the article (e.g., BBC, CNN, Al Jazeera). May contain missing values .

author- The author's full name. Some entries are missing (~5%) to simulate real-world incomplete data.

category- The general category of the article (e.g., Politics, Health, Sports, Technology).

label- The target label: real or fake news.

In [2]:
import numpy as np
import pandas as pd
import re     # helps in working with text
from nltk.corpus import stopwords    #removes unnecessary words
from nltk.stem.porter import PorterStemmer   #takes out the root word
from sklearn.feature_extraction.text import TfidfVectorizer   #converts the text ito feature vectors
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [4]:
print(stopwords.words('english'))  #unnecessary words to be removed from the dataset

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

In [6]:
news_data = pd.read_csv('/content/fake_news_dataset.csv')
news_data.head()

Unnamed: 0,title,text,date,source,author,category,label
0,Foreign Democrat final.,more tax development both store agreement lawy...,2023-03-10,NY Times,Paula George,Politics,real
1,To offer down resource great point.,probably guess western behind likely next inve...,2022-05-25,Fox News,Joseph Hill,Politics,fake
2,Himself church myself carry.,them identify forward present success risk sev...,2022-09-01,CNN,Julia Robinson,Business,fake
3,You unit its should.,phone which item yard Republican safe where po...,2023-02-07,Reuters,Mr. David Foster DDS,Science,fake
4,Billion believe employee summer how.,wonder myself fact difficult course forget exa...,2023-04-03,CNN,Austin Walker,Technology,fake


In [7]:
news_data.shape
news_data.isnull().sum()

Unnamed: 0,0
title,0
text,0
date,0
source,1000
author,1000
category,0
label,0


In [8]:
news_data = news_data.fillna('')

In [10]:
#we will use the content column to make predictions
news_data['content'] = news_data['title']+' '+ news_data['author']
print(news_data['content'])

0                     Foreign Democrat final. Paula George
1          To offer down resource great point. Joseph Hill
2              Himself church myself carry. Julia Robinson
3                You unit its should. Mr. David Foster DDS
4        Billion believe employee summer how. Austin Wa...
                               ...                        
19995                         House party born. Gary Miles
19996    Though nation people maybe price box. Maria Mc...
19997     Yet exist with experience unit. Kristen Franklin
19998                  School wide itself item. David Wise
19999        Offer chair cover senior born. James Peterson
Name: content, Length: 20000, dtype: object


In [32]:
news_data['label'] = news_data['label'].map({'real':1,'fake':0})
print(news_data['label'])

0        1
1        0
2        0
3        0
4        0
        ..
19995    0
19996    1
19997    1
19998    0
19999    0
Name: label, Length: 20000, dtype: int64


In [33]:
#seperating the data and label
X = news_data.drop( columns = 'label', axis = 1 )
Y = news_data['label']
print(X.head())
print(Y.head())

                                  title  \
0               Foreign Democrat final.   
1   To offer down resource great point.   
2          Himself church myself carry.   
3                  You unit its should.   
4  Billion believe employee summer how.   

                                                text        date    source  \
0  more tax development both store agreement lawy...  2023-03-10  NY Times   
1  probably guess western behind likely next inve...  2022-05-25  Fox News   
2  them identify forward present success risk sev...  2022-09-01       CNN   
3  phone which item yard Republican safe where po...  2023-02-07   Reuters   
4  wonder myself fact difficult course forget exa...  2023-04-03       CNN   

                 author    category  \
0          Paula George    Politics   
1           Joseph Hill    Politics   
2        Julia Robinson    Business   
3  Mr. David Foster DDS     Science   
4         Austin Walker  Technology   

                                     

**Stemming**


In [34]:
port_stem = PorterStemmer()

In [35]:
def stemming(content):
  stem_cont = re.sub('[^a-zA-Z]',' ', content)
  stem_cont = stem_cont.lower()
  stem_cont = stem_cont.split()
  stem_cont = [port_stem.stem(word) for word in stem_cont if word not in stopwords.words('english')]
  stem_cont = ' '.join(stem_cont)
  return stem_cont

In [None]:
news_data['content'] = news_data['content'].apply(stemming)
print(news_data['content'])

In [None]:
#seperating the values

X = news_data['content'].values
Y = news_data['label'].values
print(X)
print(Y)

In [41]:
#converting the text into features
vectorizer = TfidfVectorizer()
vectorizer.fit(X)
X = vectorizer.transform(X)


Splitting into training and testing data

In [44]:
X_train, X_test , Y_train , Y_test = train_test_split(X, Y, test_size = 0.2, stratify = Y, random_state = 2)


In [55]:


model = LogisticRegression()
model.fit(X_train, Y_train)


In [53]:
X_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(X_train_pred, Y_train)
print(train_accuracy)

0.65


In [57]:
X_test_pred = model.predict(X_test)
test_accuracy = accuracy_score(X_test_pred, Y_test)
print(test_accuracy)

0.50125


Making a predictive system

In [60]:
X_new = X_test[2]
prediction = model.predict(X_test)
print(prediction)

if (prediction[0]== 0):
  print('The news is real')
else:
  print('The news is fake')


[1 0 1 ... 1 0 0]
The news is fake


In [61]:
print(Y_test[2])

0
