The dataset contains real and fake news articles. Real articles (over 12,600) were collected from Reuters.com, while fake ones (over 12,600) came from unreliable websites flagged by Politifact and Wikipedia. Most articles cover politics and world news. The data is split into two CSV files: True.csv (real) and Fake.csv (fake). Each record includes the article’s title, text, type, and publication date. Articles mainly span 2016–2017, with fake news texts preserving original punctuation and errors.

In [15]:
import pandas as pd
import numpy as np
import re
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

**About libs**

**re (Regular Expressions module in Python)**
   * Used for text preprocessing such as:
   * Removing unwanted characters, punctuation, numbers.
   * Extracting specific patterns (like emails, hashtags, URLs).
   * Replacing sequences of characters (e.g., multiple spaces → single space).

**nltk.corpus.stopwords**
   * From NLTK (Natural Language Toolkit).
   * Provides a predefined list of stopwords (common words like "the", "is", "and") in many languages.
   * Used to remove non-informative words during text cleaning, since they don’t add much meaning in NLP tasks.

**nltk.stem.porter.PorterStemmer**

   * A stemming algorithm from NLTK.
   * Reduces words to their root form (not always a real word).
   * Helps treat related words as the same (e.g., "running", "runner", "ran" → "run").

**sklearn.feature_extraction.text.TfidfVectorizer**

   * Converts a collection of text documents into a numerical matrix using TF-IDF (Term Frequency – Inverse Document Frequency).
   * Purpose: represent text in a way that emphasizes important words and downplays very frequent but less meaningful ones.

   src: Chatgpt

In [16]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [17]:
print(stopwords.words('english'))# words that don't add much value to our dataset

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been', 'before', 'being', 'below', 'between', 'both', 'but', 'by', 'can', 'couldn', "couldn't", 'd', 'did', 'didn', "didn't", 'do', 'does', 'doesn', "doesn't", 'doing', 'don', "don't", 'down', 'during', 'each', 'few', 'for', 'from', 'further', 'had', 'hadn', "hadn't", 'has', 'hasn', "hasn't", 'have', 'haven', "haven't", 'having', 'he', "he'd", "he'll", 'her', 'here', 'hers', 'herself', "he's", 'him', 'himself', 'his', 'how', 'i', "i'd", 'if', "i'll", "i'm", 'in', 'into', 'is', 'isn', "isn't", 'it', "it'd", "it'll", "it's", 'its', 'itself', "i've", 'just', 'll', 'm', 'ma', 'me', 'mightn', "mightn't", 'more', 'most', 'mustn', "mustn't", 'my', 'myself', 'needn', "needn't", 'no', 'nor', 'not', 'now', 'o', 'of', 'off', 'on', 'once', 'only', 'or', 'other', 'our', 'ours', 'ourselves', 'out', 'over', 'own', 're', 's', 'same', 'shan', "shan't", 'she

Data Pre_processing

In [19]:
df_fake=pd.read_csv('Fake.csv')
df_true=pd.read_csv('True.csv')
df_fake['label']=0
df_true['label']=1
news_df = pd.concat([df_fake, df_true], axis =0 )
news_df.head(10)

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",0
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",0
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",0
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",0
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",0
5,Racist Alabama Cops Brutalize Black Boy While...,The number of cases of cops brutalizing and ki...,News,"December 25, 2017",0
6,"Fresh Off The Golf Course, Trump Lashes Out A...",Donald Trump spent a good portion of his day a...,News,"December 23, 2017",0
7,Trump Said Some INSANELY Racist Stuff Inside ...,In the wake of yet another court decision that...,News,"December 23, 2017",0
8,Former CIA Director Slams Trump Over UN Bully...,Many people have raised the alarm regarding th...,News,"December 22, 2017",0
9,WATCH: Brand-New Pro-Trump Ad Features So Muc...,Just when you might have thought we d get a br...,News,"December 21, 2017",0


In [20]:
news_df.shape

(44898, 5)

In [21]:
news_df['label'].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,23481
1,21417


In [22]:
news_df.isnull().sum()

Unnamed: 0,0
title,0
text,0
subject,0
date,0
label,0


no missing values

In [23]:
#merging title and text
news_df['content']=news_df['title']+' '+news_df['text']
news_df=news_df.drop(['title','text','subject','date'],axis=1)
news_df.head()

Unnamed: 0,label,content
0,0,Donald Trump Sends Out Embarrassing New Year’...
1,0,Drunk Bragging Trump Staffer Started Russian ...
2,0,Sheriff David Clarke Becomes An Internet Joke...
3,0,Trump Is So Obsessed He Even Has Obama’s Name...
4,0,Pope Francis Just Called Out Donald Trump Dur...


Separating Data

In [24]:
X=news_df.drop(columns='label',axis=1)
Y=news_df['label']

In [25]:
print(X,X.shape)
print(Y,Y.shape)


                                                 content
0       Donald Trump Sends Out Embarrassing New Year’...
1       Drunk Bragging Trump Staffer Started Russian ...
2       Sheriff David Clarke Becomes An Internet Joke...
3       Trump Is So Obsessed He Even Has Obama’s Name...
4       Pope Francis Just Called Out Donald Trump Dur...
...                                                  ...
21412  'Fully committed' NATO backs new U.S. approach...
21413  LexisNexis withdrew two products from Chinese ...
21414  Minsk cultural hub becomes haven from authorit...
21415  Vatican upbeat on possibility of Pope Francis ...
21416  Indonesia to buy $1.14 billion worth of Russia...

[44898 rows x 1 columns] (44898, 1)
0        0
1        0
2        0
3        0
4        0
        ..
21412    1
21413    1
21414    1
21415    1
21416    1
Name: label, Length: 44898, dtype: int64 (44898,)


Stemming :
  Stemming is the process of reducing a word to its Root word
  ex: actor , actress ,acting --> act

In [26]:
port_stem=PorterStemmer()

 Be careful: use `^` (caret) in the regex to mean "NOT these characters".  
Using `˄` (similar-looking symbol) will break the regex and remove letters instead of symbols.


In [27]:
def stemming(content):
  stemmed_content=re.sub('[^a-zA-Z ]',' ',content) # in the content leave only letter and remove other characters(numbers ,commas,..) and replace them by a space
  stemmed_content=stemmed_content.lower()# convert all letters  to lower case
  stemmed_content=stemmed_content.split()#Split the text into words and convert them to a list (list of words)
  stemmed_content=[port_stem.stem(word) for word in stemmed_content if not word in stopwords.words('english')] # stemming  and remove stopwords too
  stemmed_content=' '.join(stemmed_content) #join all the words
  return stemmed_content

In [28]:
news_df['content']=news_df['content'].apply(stemming)

In [29]:
print(news_df['content'])

0        donald trump send embarrass new year eve messa...
1        drunk brag trump staffer start russian collus ...
2        sheriff david clark becom internet joke threat...
3        trump obsess even obama name code websit imag ...
4        pope franci call donald trump christma speech ...
                               ...                        
21412    fulli commit nato back new u approach afghanis...
21413    lexisnexi withdrew two product chines market l...
21414    minsk cultur hub becom author minsk reuter sha...
21415    vatican upbeat possibl pope franci visit russi...
21416    indonesia buy billion worth russian jet jakart...
Name: content, Length: 44898, dtype: object


In [30]:
X=news_df['content']
Y=news_df['label']

In [31]:
print(X)
print(Y)

0        donald trump send embarrass new year eve messa...
1        drunk brag trump staffer start russian collus ...
2        sheriff david clark becom internet joke threat...
3        trump obsess even obama name code websit imag ...
4        pope franci call donald trump christma speech ...
                               ...                        
21412    fulli commit nato back new u approach afghanis...
21413    lexisnexi withdrew two product chines market l...
21414    minsk cultur hub becom author minsk reuter sha...
21415    vatican upbeat possibl pope franci visit russi...
21416    indonesia buy billion worth russian jet jakart...
Name: content, Length: 44898, dtype: object
0        0
1        0
2        0
3        0
4        0
        ..
21412    1
21413    1
21414    1
21415    1
21416    1
Name: label, Length: 44898, dtype: int64


In [32]:
Y.shape

(44898,)

In [34]:
X.shape

(44898,)

 TF–IDF (Term Frequency – Inverse Document Frequency) converts text into numbers:
- TF measures how often a word appears in a document.
- IDF reduces the weight of very common words across all documents.
- The product (TF × IDF) gives each word a score: high if it is frequent in a document but rare overall.
- This allows machine learning models to focus on the most informative words instead of common ones like "the" or "is".


In [35]:
#convert all textual data to numerical data
vectorizer=TfidfVectorizer()
vectorizer.fit(X)
X=vectorizer.transform(X)

In [36]:
print(X)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6904081 stored elements and shape (44898, 89868)>
  Coords	Values
  (0, 473)	0.029901721489513068
  (0, 1739)	0.05138826757528657
  (0, 1749)	0.08889530050540473
  (0, 2170)	0.02325325572885059
  (0, 2301)	0.015138822832065134
  (0, 2416)	0.030776265805381363
  (0, 2562)	0.0649607608565772
  (0, 2575)	0.036268116701632415
  (0, 2917)	0.03962726610765806
  (0, 3031)	0.037214846319311164
  (0, 3148)	0.04785142759816799
  (0, 3446)	0.0378234764133366
  (0, 5450)	0.04443233203028609
  (0, 6676)	0.023831801439703443
  (0, 7870)	0.05339370423539277
  (0, 9454)	0.028369643121826767
  (0, 10508)	0.06187015374653721
  (0, 11108)	0.06851913882817916
  (0, 11109)	0.07878871172088633
  (0, 12611)	0.04579308567051352
  (0, 13752)	0.02855422985589162
  (0, 13991)	0.03126141595114052
  (0, 14709)	0.02556349746054079
  (0, 15149)	0.024568186836617927
  (0, 15392)	0.024752264230251966
  :	:
  (44897, 72790)	0.14107018657201614
  (44897, 7348

In [37]:
X.shape

(44898, 89868)

Spliting the dataset to training and test data

In [38]:
X_train,X_test,Y_train,Y_test=train_test_split(X,Y,test_size=0.2,stratify=Y,random_state=2)

In [39]:
print(X.shape,X_train.shape,X_test.shape)

(44898, 89868) (35918, 89868) (8980, 89868)


In [40]:
print(Y.shape,Y_train.shape,Y_test.shape)

(44898,) (35918,) (8980,)


Training the model

In [41]:
lg_model=LogisticRegression()

In [42]:
lg_model.fit(X_train,Y_train)

Evaluation

In [50]:
X_train_prediction=lg_model.predict(X_train)
training_data_accuracy=accuracy_score(X_train_prediction,Y_train)
print('Accuracy on training data: ',training_data_accuracy)# good accuracy cuz of: high size of data with bainary classification problem --> logistic-reg performs very well

Accuracy on training data:  0.9912857063310875


In [51]:
X_test_prediction=lg_model.predict(X_test)
test_data_accuracy=accuracy_score(X_test_prediction,Y_test)
print('Accuracy on test data: ',test_data_accuracy)# good accuracy cuz of: high size of data with bainary classification problem --> logistic-reg performs very well

Accuracy on test data:  0.9858574610244989


Making a predicting system


In [52]:
X_new=X_test[0]
prediction=lg_model.predict(X_new)
print(prediction)
if (prediction[0]==0):
  print('The news is Real')
else:
  print('The news is Fake')

[1]
The news is Fake
