<a href="https://colab.research.google.com/github/nishita339/machine-learning-projects/blob/main/Fake_News_Prediction_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

About the Dataset:

id: unique id for a news article
title: the title of a news article
author: author of the news article
text: the text of the article; could be incomplete
label: a label that marks whether the news article is real or fake:

    1: Fake news
    
    0: real News


Importing the Dependencies

In [37]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [38]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [39]:
print(stopwords.words('english'))

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre-processing

In [40]:
news_dataset = pd.read_csv('/content/train.csv')

In [41]:
print("Dataset Shape:", news_dataset.shape)

Dataset Shape: (20800, 5)


In [42]:
print(news_dataset.head())

   id                                              title              author  \
0   0  House Dem Aide: We Didn’t Even See Comey’s Let...       Darrell Lucus   
1   1  FLYNN: Hillary Clinton, Big Woman on Campus - ...     Daniel J. Flynn   
2   2                  Why the Truth Might Get You Fired  Consortiumnews.com   
3   3  15 Civilians Killed In Single US Airstrike Hav...     Jessica Purkiss   
4   4  Iranian woman jailed for fictional unpublished...      Howard Portnoy   

                                                text  label  
0  House Dem Aide: We Didn’t Even See Comey’s Let...      1  
1  Ever get the feeling your life circles the rou...      0  
2  Why the Truth Might Get You Fired October 29, ...      1  
3  Videos 15 Civilians Killed In Single US Airstr...      1  
4  Print \nAn Iranian woman has been sentenced to...      1  


In [43]:
print("Missing Values:\n", news_dataset.isnull().sum())

Missing Values:
 id           0
title      558
author    1957
text        39
label        0
dtype: int64


Data Cleaning & Preprocessing

In [44]:
news_dataset.fillna('', inplace=True)

In [45]:
news_dataset['content'] = news_dataset['author'] + " " + news_dataset['title']

In [46]:
print(news_dataset['content'].head())

0    Darrell Lucus House Dem Aide: We Didn’t Even S...
1    Daniel J. Flynn FLYNN: Hillary Clinton, Big Wo...
2    Consortiumnews.com Why the Truth Might Get You...
3    Jessica Purkiss 15 Civilians Killed In Single ...
4    Howard Portnoy Iranian woman jailed for fictio...
Name: content, dtype: object


In [47]:
X = news_dataset['content'].values
Y = news_dataset['label'].values

In [48]:
print("Feature Set Sample:\n", X[:5])
print("Target Variable Sample:\n", Y[:5])

Feature Set Sample:
 ['Darrell Lucus House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'
 'Daniel J. Flynn FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'
 'Consortiumnews.com Why the Truth Might Get You Fired'
 'Jessica Purkiss 15 Civilians Killed In Single US Airstrike Have Been Identified'
 'Howard Portnoy Iranian woman jailed for fictional unpublished story about woman stoned to death for adultery']
Target Variable Sample:
 [1 0 1 1 1]


In [49]:
print("Target Shape:", Y.shape)

Target Shape: (20800,)


Text Preprocessing (Stemming)

In [50]:
port_stem = PorterStemmer()

In [51]:
def stemming(content):
    stemmed_content = re.sub('[^a-zA-Z]', ' ', content)  # Remove special characters
    stemmed_content = stemmed_content.lower()  # Convert to lowercase
    stemmed_content = stemmed_content.split()  # Tokenize
    stemmed_content = [port_stem.stem(word) for word in stemmed_content if word not in stopwords.words('english')]
    return ' '.join(stemmed_content)


In [52]:
news_dataset['content'] = news_dataset['content'].apply(stemming)


In [53]:
print(news_dataset['content'].head())

0    darrel lucu hous dem aid even see comey letter...
1    daniel j flynn flynn hillari clinton big woman...
2               consortiumnew com truth might get fire
3    jessica purkiss civilian kill singl us airstri...
4    howard portnoy iranian woman jail fiction unpu...
Name: content, dtype: object


Converting Text Data into Numerical Data (TF-IDF)

In [54]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(news_dataset['content'])

In [55]:
print("Transformed Feature Matrix:\n", X)

Transformed Feature Matrix:
   (0, 3600)	0.3598939188262559
  (0, 8909)	0.3635963806326075
  (0, 7005)	0.21874169089359144
  (0, 3792)	0.2705332480845492
  (0, 267)	0.27010124977708766
  (0, 4973)	0.233316966909351
  (0, 13473)	0.2565896679337957
  (0, 2959)	0.2468450128533713
  (0, 8630)	0.29212514087043684
  (0, 7692)	0.24785219520671603
  (0, 2483)	0.3676519686797209
  (0, 15686)	0.28485063562728646
  (1, 3568)	0.26373768806048464
  (1, 5503)	0.7143299355715573
  (1, 6816)	0.1904660198296849
  (1, 2813)	0.19094574062359204
  (1, 1497)	0.2939891562094648
  (1, 16799)	0.30071745655510157
  (1, 2223)	0.3827320386859759
  (1, 1894)	0.15521974226349364
  (2, 3103)	0.46097489583229645
  (2, 2943)	0.3179886800654691
  (2, 15611)	0.41544962664721613
  (2, 9620)	0.49351492943649944
  (2, 5968)	0.3474613386728292
  :	:
  (20797, 9588)	0.17455348025522197
  (20797, 7042)	0.21799048897828685
  (20797, 3643)	0.2115550061362374
  (20797, 8364)	0.22322585870464115
  (20797, 9518)	0.295420400342031

Splitting Data into Training & Test Sets

In [56]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, stratify=Y, random_state=2)

In [57]:
print("Training Data Shape:", X_train.shape)
print("Testing Data Shape:", X_test.shape)

Training Data Shape: (16640, 17128)
Testing Data Shape: (4160, 17128)


Training the Model (Logistic Regression)

In [58]:
model = LogisticRegression()

In [59]:
model.fit(X_train, Y_train)

Model Evaluation

In [60]:
X_train_prediction = model.predict(X_train)
training_data_accuracy = accuracy_score(Y_train, X_train_prediction)
print("Training Data Accuracy:", training_data_accuracy)

Training Data Accuracy: 0.9863581730769231


In [61]:
X_test_prediction = model.predict(X_test)
test_data_accuracy = accuracy_score(Y_test, X_test_prediction)
print("Test Data Accuracy:", test_data_accuracy)

Test Data Accuracy: 0.9790865384615385


Making a Prediction

In [62]:
X_new = X_test[3]

In [63]:
prediction = model.predict(X_new)

In [64]:
if prediction[0] == 0:
    print("The news is Real")
else:
    print("The news is Fake")


The news is Real


In [65]:
print("Actual Label:", "Real" if Y_test[3] == 0 else "Fake")

Actual Label: Real
